…. and why we had to resignature them.
Last year the HP EVA, where two of our ESX 3.5 clusters resided, was getting too small. Also the HP EVA in question was getting too old and we wanted to be able to take it out of the maintenance contract. So we wanted to migrate all the data to another HP EVA with enough space. This blogpost describes how we did this in two parts. The first part (this blogpost) Is a descrition of how we migrated the storage and the problems we encountered doing this. The second part will be written by Robert van den Nieuwendijk, who wrote all the PowerCLI scripts needed for this migration, and can be found, shortly, on his blog.
The Storage device that was getting too small was an HP EVA 4000 with 22 TB capacity. We already expanded a HP EVA 8000, and made sure we had enough free space on this one to host all the data of the EVA 4000.
The first problem we encountered was that the EVA 8000 already had many replication copysets (CA). There is a limitation of the maximum number of copysets, 256.
We had two choices. Either migrate the storage of the EVA 4000 in two steps, wich would be 2 major changes with downtime. Or create some new VMFS volumes on the destination system and sVmotion all the machines, that could be sVmotionned, to the destination datastores, without downtime, and plan a change for the remainder. The latter is what we did. Furthermore we did some cleaning up of the VM’s and other storage, leaving the source EVA with just a little over the number of allowed vDisks to be replicated.
Among those disks were the ESX 3.5 boot disks. Those could be replicated and failed over, before the actual storage migration of the VM’s.
Here we encountered our second problem. After failing over the ESX hosts, an error message was shown on the Service Console of the host, “XXX may be snapshot, disabling access, see resignaturing info…”.
The ESX hosts all started just fine and first I thought nothing of it. Until I remembered what happened when I encountered a similar problem with our virtualization testing environment just over a year ago. All my datastores, simply disappeared, when I placed a ESX 4 host in my test cluster. After checking de datastores of the ESX hosts that had a failover, it turned out, they also were missing their corresponding boot VMFS volumes. This didn’t seem to be a problem for the ESX hosts, but with the migration step of the rest of the datastores, with virtual machines, just over a week away, I felt this wasn’t something I could ignore.
The boot vmfs volumes were easily restored by creating a new vmfs:
- To find the device with the boot LUN
# esxcfg-mpath –l
- Or (when the device is sda)
# esxcfg-mpath –l | grep sda
- To find the partition, check for the unknown part with fb, (vmhba1:2:1:5).
# fdisk –l /dev/sda
- Then create a new vmfs
# vmkfstools –C vmfs3 –b 1m –S name_boot vmhba1:2:1:5
We postponed the migration for 2 weeks and I set up a testing environment to try and replicate the problem. After reading up on the problem a bit, I found that this “may be snapshot” issue is the result of the way the ESX creates the UUID of the VMFS volume. The ESX can unjustly mark a VMFS volume as a snapshot when it detects a change in the UUID of the VMFS volume. The UUID of a VMFS volume is composed out of several components. The LUN number of the disk is one, the UUID of the disk on the storage device is one, but those were all the same, since I presented everything with the same LUNs and being a failover action, the UUID of the EVA was also the same. We even had failovers before, but never this problem. There was just one difference, the failover was done from an HP EVA 4000 to a HP EVA 8000 and, as it turns out, ESX also uses the driver signature of the storage device in creating the UUID of the VMFS volume. Wich, of course, differs between a HP EVA 4000 and a HP EVA 8000. Hence the UUID of the datastores were different after the failover and were, unjustly, detected as a snapshot. In our testing environment I was able to reproduce this behavior.
So the problem was identified, next the solution.
Now that I had identified the problem and researched it, it seemed we had two options to tackle the problem.
- Set Disallowsnapshotlun to 0 (zero) (LVM.DisallowSnapshotLUN) VMware KB 6482648
- Set Snapshot Resignature temporarily to 1 (LVM.EnableResignaturing) VMware KB 9453805
Both with their own pro’s and con’s.
First: Set Disallowsnapshotlun to 0 (zero)
- Advantage, it’s fast and simple.
- Disadvantage, You cannot present snapshots to an ESX cluster with this setting, this will give datacorruption. Since this is not the default setting, you will have to be very carefull about documenting this.
Second: Set Snapshot Resignature temporarily to 1
- Advantage: All disks get a completely new UUID, leaving you with a nice and fresh environment.
- Disadvantage: It’s a lOT of work. You have to export your machine data (annotations, status etc), power down all machines (we’d have to do that anyhow), remove all machines from the inventory, do the failover, do the resignature, relabel all vmfs volumes and import all machines back into the inventory. Move everything into the correct folders and resource pools, set the annotations again, remap all RDM’s and cleanup the “old” vmfs volumes.
Even though the second option was quite a task, we decided to go for it and resignature our disks. Obviously it was all way to much work to do by hand so we decided to script it all using PowerCLI. Robert van den Nieuwendijk, who’s blog you can read here, wrote all the PowerCLI scripts and I wrote a plan, describing every step, making sure we wouldn’t skip anything on the day. I went ahead and made all the storage preparations, including the SSSU scripts for the day itself.
On the day itself everything went smoothly and as planned, very much so, thanks to Robert’s scripts and we managed to failover all the storage, resignature and get every machine running again in just under 4 hours.