The Case for High Availability

Edit: Shawn still gives awesome presentations.

Originally posted November 15, 2010 on AIXchange

Recently I attended a session on the IBM PowerHA high-availability solutions. The point was made that, given the reliability and uptime of IBM Power servers, many customers wonder why they even need an HA solution.

IBM’s Shawn Bodily, our PowerHA presenter, described one of his typical customer interactions: First, another IBM representative will tell the customer about the hardware and the systems’ reliability, availability and serviceability (RAS) features. Then a second rep will discuss live partition mobility and how it seamlessly shifts logical partitions from one frame to another.

So after 20 to 30 minutes of hearing about how the hardware never fails, THEN Shawn must step in and explain why the customer should be concerned with high availability and disaster recovery. That’s one tough act to follow.

So why should you care about high availability and disaster recovery? I’m reminded of something I heard at another presentation, this one at an IBM Technical University conference: “What’s the most important thing in the data center?”

I can’t recall the name of the presenter who asked that question, but I definitely know the answer. The most important thing in the data center is the applications that run on the systems. These applications are the reason we buy the systems. Really, we don’t worry about systems going down; we worry about systems going down and losing access to the applications. Or maybe it takes a system failure before we realize just how critical a given application is to the organization. When users can no longer login, when processing no longer occurs, when the cost of said failure soars by the minute–that’s what we worry about.

A Standish Group study from a few years ago estimated that only about 20 percent of outages are a result of hardware failure. And with today’s Power hardware, one can readily assume that that percentage has diminished even further.

So what else can go wrong? What about something like planned maintenance? Live partition mobility might help if your hardware alerts you to the need for a fix. Then you just move the workload off of the machine, perform the service and move the workload back on. But, as Shawn pointed out, what good is it to move your workload if you need to update the application or OS?

In those scenarios, we might look at multibos updates. Or we might look at using a product like PowerHA to fail our workload to a standby node. Yes, you’ll see an outage while the application is stopped and then restarted, but only a brief one.

The point is, things happen. Certainly we’ve seen our share of natural disasters in recent years. Or what about a simple power outage that knocks out the electricity and air conditioning? What about operator/user/human error? A mistake is made, files get deleted. Things do happen. These are the reasons  you should care about high availability and disaster recovery. You may need it. At some point you may need to bring your systems up in another location.

Ask yourself the questions that Shawn asked us: How long can you afford to be without your systems? When your systems are recovered, how much data can you afford to lose? I don’t know any companies that really want to be without their systems for any length of time. I can’t imagine any that would view their data as expendable.

When it comes to high availability and disaster recovery, the time to think about it is now–not after you’re hit with something unexpected.