The Importance of DR Testing

Edit: Taking a backup without actually testing you can recover is really just as good as making a wish. Link no longer works.

Originally posted April 3, 2012 on AIXchange

Recently my customer wanted to see if its old, unsupported application could be recovered in an emergency. They were running AIX 5.2, and I was cloning that to another piece of older hardware for a disaster/recovery test. While the customer had been taking mksysb and application backups for some time, this was the first actual attempt to recover the system.

After restoring the mksysb to the target machine, we went to run the vendor’s built-in scripts to recover the database. It turns out that all of the data and binaries we needed to run the recovery operation was on datavg instead of rootvg.

It also turned out there was no datavg backup, only the database backups. This became the first item to address, ensuring we had a savevg of datavg. In this case the customer had 135 logical volumes that had to be recreated, some jfs2 and some raw.

The customer really wanted this cloned machine to be identical to the source machine, down to the PP size, but no way was I going to recreate 135 logical volumes manually. So I went ahead and did a savevg from the source machine. Had this been an actual DR situation, we would have already been in trouble had the LV information not been stored somewhere. (Hint: Besides backups, it may be handy to have output from important files like /etc/filesystems available to you in case of an emergency.)

When I tried to restore the information from the savevg and remake the volume group

(smitty/system storage management/logical volume manager/volume groups/remake a volume group),

it kept coming up with a 512MB PP size instead of the 64MB PP size I was inputting. Even when I tried it from the command line (restvg –f /dev/rmt0 –r –n –P 64 hdisk4), it’d still create the 512MB PP size.

However, since I still had the source system, it was a simple matter of taking the logical volume information from the source volume group and copying it to /tmp:

lsvg –l datavg > /tmp/datavgout.file

Because I only cared about the first and third columns of the lsvg output, I ran this command to obtain the LV name and LV type:

cat /tmp/datavgout.file | awk ‘{print $1, $3}’ > /tmp/datavgout2.file

I created the datavg on the target system manually with the 64 MB PP size. Then I edited the datavgout2.file and made sure it had the correct LV name, LV type and the number of PPs that I wanted on the target machine. To read the file and create the LVs, I ran this simple loop:

cat ‘/tmp/datavgout2.file’ | while read $i $j $k
do
mklv –t $j –y $i datavg1 $k
done

($i is the name, $j is the type and $k is the number of PPs.)

I did end up using smitty (smitty/system storage management/logical volume manager/volume groups/restore) to restore the files in the jfs2 filesystems.

Once the volume group had been recreated and the necessary files were restored, I could use the database backup tape to restore the database.

The customer now takes a daily savevg of datavg, and all of the necessary LV information from the entire system is saved as part of the rootvg backup. In the end we were able to get a running system. Even more important, we learned something. Without going through this exercise, my customer may have been missing key information and data it needed to restore the system in an actual disaster.

This is why simply having a DR plan isn’t enough. Your plan must be tested. Even if you think you have good backups, it might not be the case.

Incidentally, Anthony English recently wrote about recovering datavg filesystems as well. He discusses using the mkvgdata (to capture the volume group structure) and restvg commands. His information is worth considering if you find yourself recreating a volume group. It’s certainly simpler than going through the gyrations that I went through, though I’ll still need to test it to see if Anthony’s approach will eliminate my customer’s PP issue.