Just testing the VSAN 6.2 (v3) with 3 node (name them A,B,C) cluster in my lab. Decided to simulate some power loss on all nodes but with some delay.
Had 8VMs running a disk benchmark at the time of power loss.
First I shut down the node C. Waited for 5 minutes and powered off A and B with 2 seconds interval.
Waited for a couple of minutes and started to power on but in reverse.
I deliberately powered on the C first knowing the data on it will be outdated. I wanted to test how the cluster recovers.
After 2 minutes I turned on two remaining hosts.
During the boot process ESXi spend several minutes initializing the VSAN disks. Even though the C started first there was a period of time when all three host were in VSAN initialization process at the same time.
I thought it will be enough for the system to resynchronize but I was wrong!
After all 3 hosts were online 3 out of 8 MWs were in the Inaccessible STATE. 3 other VMs were accessible but out of sync. 2 were healthy.
The cluster is stuck on rebuilding of one object. The object is only 1GB but after 10 hours it is in the same state:
/localhost/DC74/computers/CLSTR01> vsan.resync_dashboard ~cluster
2016-04-09 05:41:16 -0500: Querying all VMs on VSAN ...
2016-04-09 05:41:16 -0500: Querying all objects in the system from b1200. ...
2016-04-09 05:41:17 -0500: Got all the info, computing table ...
+-----------------------------------------------------------------------+-----------------+---------------+
| VM/Object | Syncing objects | Bytes to sync |
+-----------------------------------------------------------------------+-----------------+---------------+
| A_temp_moving | 1 | |
| [vsan_ssd1] 3574e73e-d79b-d092-8bfb-00266cf2880c/A_temp_moving.vmx | | 1.00 GB |
+-----------------------------------------------------------------------+-----------------+---------------+
| Total | 1 | 1.00 GB |
+-----------------------------------------------------------------------+-----------------+---------------+
3 VSAN objects corresponding the 3 inaccessible VMs are also marked inaccessible:
/localhost/DC74/computers/CLSTR01> vsan.check_state ~cluster
2016-04-09 05:45:42 -0500: Step 1: Check for inaccessible VSAN objects
Detected 3 objects to be inaccessible
Detected cef8e73e-955e-f306-1078-00266cf2880c on b1200. to be inaccessible
Detected 31f8e73e-ab51-c751-5bb5-00266cf2880c on b1200. to be inaccessible
Detected e8f8e73e-7056-0d95-1f51-00266cf2880c on b9100. to be inaccessible
2016-04-09 05:45:42 -0500: Step 2: Check for invalid/inaccessible VMs
Detected VM 'A_temp_moving8' as being 'inaccessible'
Detected VM 'A_temp_moving6' as being 'inaccessible'
2016-04-09 05:45:42 -0500: Step 3: Check for VMs for which VC/hostd/vmx are out of sync
Found VMs for which VC/hostd/vmx are out of sync:
A_temp_moving9
A_temp_moving7
A_temp_moving4
A_temp_movin3
Examining of the inaccessible objects further shows that some of them have All components ACTIVE, but marked as STALE! Other have 2 components ACTIVE one missing but all marked STALE.
<LSTR01> vsan.object_info ~cluster 31f8e73e-ab51-c751-5bb5-00266cf2880c
DOM Object: 31f8e73e-ab51-c751-5bb5-00266cf2880c (v3, owner: b1200., policy: No POLICY entry found in CMMDS)
RAID_1
Component: 31f8e73e-738f-3952-44b2-00266cf2880c (state: ACTIVE (5), csn: STALE (owner stale), host: b4300., md: 5288d2ea-7203-770d-5875-a5a721d925bc, ssd: 52f5b565-c20e-17d9-6b1f-6ebd6c50ae23,
votes: 1, usage: 0.4 GB)
Component: 31f8e73e-f1f7-3b52-08a5-00266cf2880c (state: ACTIVE (5), csn: STALE (owner stale), host: b1200., md: 527993ca-0b3f-d92f-a78b-a94ceccec98d, ssd: 52a32e68-80fc-285d-23f1-1758e80d63a5,
votes: 1, usage: 0.4 GB)
Witness: 31f8e73e-53f0-3d52-c1cc-00266cf2880c (state: ACTIVE (5), host: b9100., md: 52d67be3-f1a1-c6df-c4fa-60100694133c, ssd: 528b8cac-c51d-f0fb-f5fe-7c4d1fd1220d,
votes: 1, usage: 0.0 GB)
Extended attributes:
Address space: 273804165120B (255.00 GB)
Object class: vmnamespace
Object path: /vmfs/volumes/vsan:52f5f66efc19ecc0-f72aa19c783c8172/
Object capabilities: NONE
I tried to fix it with the "vsan.check_state –r –e ~cluster" command but it didn’t change anything. I also tried to go the Virtal SAN tab in vsphere client and repair it with “Repair object immediately” button but it was greyed out.
Does anybody have the solution for the problem?
Shouldn’t there be any tool to tell the VSAN what of the STALE object to use if all in such state?
Shouldn’t there be an automatic recovering of situation like this in the Enterprise class solutions?