The Value and Challenges of a Spare Disk

Edit: Hopefully this will help someone if they run across the same issue.

Originally posted June 16, 2020 on AIXchange

Having a spare internal disk is handy in general but that doesn’t come without navigating some challenges. Rob McNelly gives the breakdown on both.

When I build a new system, I prefer to have the VIO servers run from internal disks. Sure it’s slower, but it’s a trade-off I’m willing to make. This way I know if I lose the SAN, I can still boot VIOS to conduct some troubleshooting.
 
If the customer is willing, I like to have some spare SAS disks available in the frame as well. That takes the urgency out of replacing a failed disk. We just immediately migrate our data to the currently inactive disk we have available, and then swap out the bad one at our leisure after IBM ships the new disk.
 
Having a spare internal disk is pretty handy in general. During the early stages of racking and stacking systems, while the SAN team is creating and mapping LUNs, an available disk can be used to load a test LPAR or be a place to stage a NIM server, among many other things.
 
For instance, during a recent POWER9 model S922 install, I created a temporary NIM server on an internal disk that I mapped to a client LPAR. My plan was to migrate it to a SAN LUN once they were available. While I’ve done this many times without a hitch, in this instance, running extendvg rootvg hdisk1 (where hdisk1 is my new SAN LUN) triggered this error:
 
#extendvg rootvg hdisk1
0516-1980 extendvg: Block size of all disks in the volume group must be the same. Cannot mix disks with different block sizes. 0516-792 extendvg: Unable to extend volume group.
 
A search for the message brought up this IBM Support doc:
 
“Such an error means you cannot mix a physical volume (PV) of 4 KB block size with PV blocks of other sizes. The block size of all PVs in the volume group must be the same.
 
“This is explicitly mentioned in the man page for the extendvg command.
 
“Unfortunately, there is no way to fix that issue, because AIX, at the present, only supports 4K block sizes on sissas drives. AIX does not support 4K block sizes on fibre-attached storage. There is no option for block-size translation for SAS disks in firmware nor in AIX kernel.
 
“Naturally, that means that you will not be able to include both sissas drive and fibre-attached drives within the same volume group, since the volume group requires all disks within the volume group to utilize the same disk block size.”
 
I was left with a few choices. I could backup my system to a mksysb and restore it from a NIM server using that mksysb. Of course that was problematic in this case, since this was the NIM server I was trying to move from internal disk in the first place. The better option was to bail myself out using the alt_disk_copy command.
  
“The alt_disk_copy command allows users to copy the current rootvg to an alternate disk and to update the operating system to the next maintenance or technology level, without taking the machine down for an extended period of time and mitigating outage risk. This can be done by creating a copy of the current rootvg on an alternate disk and simultaneously applying software updates. If needed, the bootlist command can be run after the new disk has been booted, and the bootlist can be changed to boot back to the older maintenance or technology level of the operating system.”
 
I wasn’t looking to update the software; I only wanted to copy rootvg to my new disk. I ran:
 
            #alt_disk_copy  -B -V -d hdisk1
 
After running a bosboot, I modified my bootlist so it pointed to the new disk and rebooted from the SAN LUN. Once this was done and I was satisfied everything was running as expected on the new LUN, I cleaned up the original rootvg by running:
 
#alt_rootvg_op -X old_rootvg
 
Sometimes in these situations I don’t immediately consider the power we have with AIX®  and IBM Power Systems™ hardware. With alt_disk_copy, I don’t always have to do backup/restore operations, or even migratepv. Of course it may require a reboot, but in this instance it was well worth the time spent.