開發與維運

VMware故障案例分享-ESXi6.0磁盤擁堵

故障現象:

一臺esxi6.0於7月5號3am左右出現 一塊 ssd congestion;這塊ssd卡是剛更換上去的;將這塊 ssd卡unmount之後,ssd congestion 消失了。

分析過程:

產品版本信息。

Huawei Technologies Co., Ltd. RH2288H V2-24S | BIOS: RMIBV503 | Date (ISO-8601): 2015-03-09
VMware ESXi 6.0.0 build-6921384
ESXi 6.0 Patch 6 ESXi600-201711001 11/9/2017 6921384 N/A

檢查vSAN使用的hba卡的驅動信息,發現連接兩個SSD所使用的hba3和hba4並不在vSAN的兼容列表裡面。
Support Bundle: .(ESXi 6.0 U3) Virtual SAN Enabled: Yes
HBA: vmhba4

Huawei <class> Mass storage controller(19e5:0007 19e5:0007) Status:  Not Listed on HCL
hio 2.1.0.23 Status:  Not checked

HBA: vmhba2

LSI Logic / Symbios Logic LSI2308_2(1000:0087 1000:0087) Status:  Found Match on HCL
mpt2sas 19.00.00.00.1vmw Status:  Driver/Version As per HCL
Recommended Drivers for version ESXi 6.0 U3:
    Driver: mpt2sas Ver:19.00.00.00.1vmw (Match Confidence: 100) Firmware: 19.00.00.00-IT
VCG link: http://vcg-stg-vip-1.vmware.com/comp_guide2/detail.php?deviceCategory=vsanio&productid=39286

HBA: vmhba3

Huawei <class> Mass storage controller(19e5:0007 19e5:0007) Status:  Not Listed on HCL
hio 2.1.0.23 Status:  Not checked

vmhba2 mpt2sas 19.00.00.00.1vmw 1000 0087 1000 0087 LSI Logic / Symbios Logic LSI2308_2
vmhba3 hio 2.1.0.23 19e5 0007 19e5 0007
vmhba4 hio 2.1.0.23 19e5 0007 19e5 0007

Disk Group: 5275c7e2-f296-a38e-9b0d-15fe4aea962c
Device Type In CMMDS Vendor Model Revision Offline? Size Transport HBA
t10.hioa___00030PXS10D6000058 SSD false Huawei ES3000 2.0 false 1121.81GB parallel vmhba4
naa.5000cca0720a8210 MD false HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000c500c18343d7 MD false SEAGATE ST1200MM0009 N003 false 1117.81GB sas vmhba2
naa.5000cca0720a5a60 MD false HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca0720a15c8 MD false HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000c500c15a6cfb MD false SEAGATE ST1200MM0009 N003 false 1117.81GB sas vmhba2
naa.5000cca07209c0cc MD false HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000c500c1a59b2f MD false SEAGATE ST1200MM0009 N003 false 1117.81GB sas vmhba2

Disk Group: 52d8a147-5bf1-2fa3-f755-ffc14a44ab8f
Device Type In CMMDS Vendor Model Revision Offline? Size Transport HBA
naa.5000cca0720a63d4 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca0720a1c3c MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca07209d7a8 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
t10.hiob___00030PXT10F3000198 SSD true Huawei ES3000 2.0 false 747.88GB parallel vmhba3
naa.5000cca0720a5cb4 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca07209ab74 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca0720a5a4c MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca0720a7bf4 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2

檢查vsan磁盤信息可以看到這個磁盤組的InCMMDS都是false,應該是沒有被mount到vSAN中。

2020-07-02T09:04:01Z shell[38762]: [root]: esxcli vsan storage add -d naa.5000cca0720a5fac -s t10.hioa___00030PXS10D6000058
2020-07-02T09:05:56Z shell[38762]: [root]: esxcli vsan storage add -d naa.5000cca0720a5fac -s t10.hioa___00030PXS10D6000058
2020-07-02T09:07:25Z shell[38762]: [root]: esxcli vsan storage add -d naa.5000c500c15a6cfb -s t10.hioa___00030PXS10D6000058
2020-07-02T09:07:43Z shell[38762]: [root]: esxcli vsan storage add -d naa.5000cca07209c0cc -s t10.hioa___00030PXS10D6000058
2020-07-02T09:07:55Z shell[38762]: [root]: esxcli vsan storage add -d naa.5000cca0720a5a60 -s t10.hioa___00030PXS10D6000058
2020-07-02T09:08:05Z shell[38762]: [root]: esxcli vsan storage add -d naa.5000c500c1a59b2f -s t10.hioa___00030PXS10D6000058
2020-07-02T09:08:13Z shell[38762]: [root]: esxcli vsan storage add -d naa.5000cca0720a15c8 -s t10.hioa___00030PXS10D6000058
2020-07-02T09:08:25Z shell[38762]: [root]: esxcli vsan storage add -d naa.5000c500c18343d7 -s t10.hioa___00030PXS10D6000058
2020-07-05T00:40:49Z shell[37182]: [root]: esxcli vsan storage diskgroup unmount -d t10.hioa___00030PXS10D6000058
2020-07-05T00:41:08Z shell[37182]: [root]: esxcli vsan storage diskgroup unmount -s t10.hioa___00030PXS10D6000058

Disk Group: 5275c7e2-f296-a38e-9b0d-15fe4aea962c

Device Type In CMMDS Vendor Model Revision Offline? Size Transport HBA
t10.hioa___00030PXS10D6000058 SSD false Huawei ES3000 2.0 false 1121.81GB parallel vmhba4
naa.5000cca0720a8210 MD false HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000c500c18343d7 MD false SEAGATE ST1200MM0009 N003 false 1117.81GB sas vmhba2
naa.5000cca0720a5a60 MD false HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca0720a15c8 MD false HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000c500c15a6cfb MD false SEAGATE ST1200MM0009 N003 false 1117.81GB sas vmhba2
naa.5000cca07209c0cc MD false HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000c500c1a59b2f MD false SEAGATE ST1200MM0009 N003 false 1117.81GB sas vmhba2
Disk Group: 52d8a147-5bf1-2fa3-f755-ffc14a44ab8f
Device Type In CMMDS Vendor Model Revision Offline? Size Transport HBA
naa.5000cca0720a63d4 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca0720a1c3c MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca07209d7a8 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
t10.hiob___00030PXT10F3000198 SSD true Huawei ES3000 2.0 false 747.88GB parallel vmhba3
naa.5000cca0720a5cb4 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca07209ab74 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca0720a5a4c MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2
naa.5000cca0720a7bf4 MD true HGST HUC101212CSS600 A440 false 1117.81GB sas vmhba2

檢查vobd和vmkernel日誌,可以看到下面時間點開始該磁盤組突然開始報擁堵,並且擁堵前驅動和磁盤都沒有報錯。
vobd.log
2020-07-04T01:56:52.204Z: [VsanCorrelator] 60119767788us: [vob.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Exceeded. Cong
estion Threshold: 200 Current Congestion: 202.
2020-07-04T01:56:52.204Z: [VsanCorrelator] 60120438700us: [esx.problem.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Exceed

  1. Congestion Threshold: 200 Current Congestion: 202.
    2020-07-04T01:57:52.204Z: [VsanCorrelator] 60179767871us: [vob.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Normal. Conges

tion Threshold: 200 Current Congestion: 148.
2020-07-04T01:57:52.204Z: [VsanCorrelator] 60180439130us: [esx.problem.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Normal
. Congestion Threshold: 200 Current Congestion: 148.
2020-07-04T01:58:52.224Z: [VsanCorrelator] 60239787321us: [vob.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Exceeded. Cong
estion Threshold: 200 Current Congestion: 202.
2020-07-04T01:58:52.224Z: [VsanCorrelator] 60240459260us: [esx.problem.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Exceed

  1. Congestion Threshold: 200 Current Congestion: 202.
    2020-07-04T01:59:52.226Z: [VsanCorrelator] 60299788319us: [vob.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Normal. Conges

tion Threshold: 200 Current Congestion: 148.
2020-07-04T01:59:52.226Z: [VsanCorrelator] 60300460977us: [esx.problem.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Normal
. Congestion Threshold: 200 Current Congestion: 148.
2020-07-04T02:00:52.233Z: [VsanCorrelator] 60359795073us: [vob.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Exceeded. Cong
estion Threshold: 200 Current Congestion: 202.
2020-07-04T02:00:52.233Z: [VsanCorrelator] 60360468427us: [esx.problem.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Exceed

  1. Congestion Threshold: 200 Current Congestion: 202.
    2020-07-04T02:01:52.235Z: [VsanCorrelator] 60419795887us: [vob.vsan.lsom.congestionthreshold] LSOM SSD 5275c7e2-f296-a38e-9b0d-15fe4aea962c Congestion State: Normal. Conges

tion Threshold: 200 Current Congestion: 148.

結論:

雖然報擁堵的時候總是報磁盤組SSD擁堵,但是造成擁堵的原因可以是IO鏈上HBA卡、SSD、MD的性能問題導致的,
擁堵的原因調查非常複雜,由於日誌中並沒有驅動、ssd,md相關報錯信息,無法僅僅通過日誌信息來判斷擁堵具體原因的。
對於這臺主機,可以嘗試更換兼容的HBA卡的驅動固件,再加入mount磁盤進行觀察.如果還有問題,可以考慮升級ESXi到6.5及以上版本,並同步更新兼容的驅動固件.

Leave a Reply

Your email address will not be published. Required fields are marked *