Real World Disaster Recovery

Edit: One of my favorite articles.

Originally posted June 2006 by IBM Systems Magazine

Disaster recovery (D/R) planning and testing has been a large part of my career. I’ve never forgotten my first computer-operations position and the manager who showed me a cartoon of two guys living on the street. One turned and said to the other, “I did a good job, but I forgot to take good backups.”

I’ve been involved in D/R exercises for a variety of customers, and I was also peripherally involved on a D/R event that happened after Hurricane Katrina. There’s a big difference between planned and unplanned D/R events.

Does your datacenter have the right procedures and equipment in place to recover your business from a disaster? Can your business survive extended downtime without your computing resources? Is your company prepared for a planned D/R event? What about an unplanned event? I’ve helped customers recover from both types of events. This article provides a place to start when considering D/R preparations for your organization.

Comfortable Circumstances

There’s a big difference between planned and unplanned D/R events. After traveling to an IBM* Business Continuity and Recovery Services (BCRS) center, I helped restore 20 AIX* machines during the 72 allocated hours. I was well-rested and well-fed. We knew the objectives ahead of time, and we took turns working and resting. Additionally, we didn’t restore all of the servers in the environment, but hand-picked a cross-section of them. We modified, reviewed and tested our recovery documentation before we made the trip, and we made sure there was enough boot media to do all the restores simultaneously – and even cut an extra set of backup tapes just in case.

We had a few minor glitches along the way, but we were satisfied that we could recover our environment. However, these results must be taken with a grain of salt, as this whole event was executed under ideal circumstances.

In another exercise, I didn’t have to travel anywhere; I went to the BCRS suite at my normal IBM site and spent the day doing a mock D/R exercise. We were done within 12 hours. We had a few minor problems, but the team agreed that we could recover the environment in the event of an actual disaster. Again, I was well-rested and well-fed.

Katrina Circumstances

As Hurricane Katrina was about to make landfall, e-mails went out asking for volunteers to help with customer-recovery efforts. I submitted my name, but there were plenty of volunteers, so I wasn’t needed. A few weeks later, the AIX admin that had been working on the recovery got sick, and I was asked to travel onsite to help.

Although I can’t compare the little bit that I did with the Herculean efforts that were made before I arrived, I was able to observe some things that might be useful during your planning.

A real D/R was much different from the tests that I’d been involved with in the past. The people worked around the clock in cramped quarters, getting very little sleep. There were too many people on the raised floor, and there weren’t enough LAN drops for the technicians to be on the network simultaneously.

The equipment this customer was using needed to be refreshed, so there was an equipment refresh along with a data recovery, which posed additional problems during the environment recovery. Fortunately, the customer had a hot backup site where the company could continue operations while this new environment was being built. However, as is often the case, the hot backup site had older, less powerful hardware. It was operational – but barely – and we wanted to get another primary site running quickly.

One of the obvious methods of disaster preparation is to have a backup site that you can use if your primary location goes down. Years ago, I worked for a company that had three sites taking inbound phone calls. They had identical copies of the database running simultaneously on three different machines. They could switch over to the other sites as needed. During the time I was there, we had issues (snow, rain, power, hardware, etc.) that necessitated a switch over to a remote location. We needed to bring down two sites and temporarily run the whole operation on a single computer. This was quite a luxury, but the needs of the business demanded that was the route to be taken. This might be something to consider as you assess your needs.

Leadership must be established before beginning – either during a test or a real disaster. Who’s in charge: the IBM D/R coordinator, the customer or the technicians? And which technicians are driving the project: the administrators from the customer site, consultants or other technicians? All of these issues should be clearly defined so people can work on the task at hand and avoid any potential political issues.

The Importance of Backups

During my time with the Katrina customer recovery, I found out that one of the customer’s administrators had to be let go. It turns out that he’d been doing a great job with his backup jobs. He ran incremental backups every night, and they ran quickly. However, nobody knew how many years ago he’d taken his last full backup. The backup tapes were useless. Fortunately, their datacenter wasn’t flooded and, after the water receded, they were able to recover some of their hardware and data.

Are your backups running? Are you backing up the right data? Have you tested a restore? One of the lessons we learned during a recovery exercise was that our mksysb restore took much longer than our backup. Another lesson we learned was that sysback tapes may or may not boot on different hardware. Does your D/R site/backup site have identical hardware? Does your D/R contract guarantee what hardware will be available to you? Do you even have a D/R contract?

Personnel Issues

We had personnel working on this project who were from the original customer location and knew how to rebuild the machines. However, they were somewhat distracted as they worried about housing and feeding their families and finding out what had happened to their property back home. Some were driving hundreds of miles to go home on the weekend – cleaning up what they could – and then making the long drive back to the recovery site. Can you give your employees the needed time away from the recovery so they can attend to their personal needs? What if your employees simply aren’t available to answer questions? Will you be able to recover?

Other Issues

Other issues that came up involved lodging, food and transportation. FEMA was booking hotel rooms for firefighters and other rescue workers, so finding places to stay was a challenge. For a time, people were working around the clock in rotating shifts. Coordinating hotel rooms and meals was a full-time job. Instead of wasting time looking for food, the support staff brought meals in and everyone came to the conference room to eat.

You may remember that Hurricane Rita was the next to arrive, so there were fresh worries about what this storm might do, and gasoline shortages started to occur. After you’ve survived the initial disaster, will you be able to continue with operations? I remember reading a blog around this time about some guys in a datacenter in New Orleans and all the things they did to keep their generators and machines operational. Do you have employees who are willing to make personal sacrifices to keep your business going? Will you have the supplies available to keep the people supporting the computers fed and rested?

Test, Test, Test

I highly recommend testing your D/R documentation. If it doesn’t exist, I’d start working on it. Are you prepared to continue functioning when the next disaster strikes? Will a backhoe knock out communications to your site and leave you without the ability to continue serving your customers? Do you have a BCRS contract in place? I know I don’t want to end up like the guy in the cartoon complaining that he did not have good backups and D/R procedures in place. Do you?