A System Outage, and the Failures that Led to It

Edit: Some links no longer work.

Originally posted October 21, 2014 on AIXchange

Old Power servers just run. Most of us know of machines that sat in a corner and did their thing for many years. However, as impressive as Power hardware is, running an old, unsupported production server with an old, unsupported operating system isn’t advisable. This is one such story of a customer and its old, dusty machine that sat in a back room.

This customer had no maintenance; they simply hoped their box would continue to hum along. To me, that’s like taking your car to a shop and telling the mechanic: “The check engine light has been on for years, and I’ve never changed the oil or checked the tires. Why are you charging me so much to fix this?”

I imagine some of you are thinking about the fact that older applications can be kept in place by running versioned AIX 5.2 or AIX 5.3 WPARs on AIX 7. That option wasn’t selected in this case, however. This was a server running AIX 5.2 and a pair of old SCSI internal disks that were mirrored together in rootvg. Eventually, one of those disks began to fail.

When did it begin to fail? No one knew, because no one monitored the error logs. When the machine finally had enough, it crashed. Reboots would stop at LED 0518. In isolation, that’s no big deal. Just boot the machine into maintenance mode and run fsck.

In this case though, going into maintenance mode only resulted in more unanswerable questions. Where’s the install media? No one knew. No one knew where the most recent mksysb was, either. Ditto for the whereabouts of the keyboard for the console. No one knew. Time to start sweating.

Because this was a standalone server, there was no NIM server. Because it was a production machine, the outage affected several locations. Booting an older version of AIX and then trying to recover a newer version on rootvg is often problematic, and this instance was no exception. Though the customer could get AIX 5.2 media shipped to them from another location, they’d have to wait a day, and there was no guarantee that this version would be at the same level as the operating system they were using.

It turns out this customer was very, very fortunate, because someone, somehow located a 4-year-old mksysb tape. The machine booted from the tape drive and the customer was able to get it into maintenance mode, access a root volume group and run fsck on the rootvg filesystems. Some errors were corrected and the machine was able to boot. From there it was a relatively simple case of unmirroring the bad disk and replacing it with a new disk.

While naturally I’m happy that this customer resolved their issue, I present this story as a cautionary tale. Think of all the things that went neglected prior to the disk failure. Filesystems and errpt weren’t monitored. While nightly data backups were being taken, there were no recent mksysb backups. It’s possible that the last mksysb was taken at the time of system installation. There were no OS disks on hand. Only luck kept this customer from experiencing substantial downtime and losing significant business.

Now consider your environment. Do you occasionally take the time to restore your critical systems on a test basis, just to prove that you could restore them in an actual emergency? If you couldn’t boot a critical system, could you recover it? How long would it take?