How Much is Too Much Downtime?

Edit: The link has changed, the whitepaper was revised in 2011, but you can still read about it, although I do not think this applies anymore unless you are running some ancient code and older hardware.

Originally posted September 30, 2008 on AIXchange

How often do you hear someone say they’re happy running their applications using Linux on their x86 hardware? They don’t want to hear about Power systems–in their minds they perceive them to be “too expensive.”

I always wonder how much is too much when you’re running your core business applications on these commodity servers. Really, it comes down to how much system downtime can you afford in your environment.

How quickly do you want to be able to call support, diagnose a problem, dispatch a CE and have a repair made? Better yet, what if your machine detects problems and “heals itself,” calling home to IBM so the service reps can let you know that your machine is reporting that it needs service.

If downtime doesn’t translate into lost dollars for your business, then maybe you can afford to take a commodity hardware approach. Some people are just fine deploying server farms consisting of commodity hardware. If they lose one machine, it’s no big deal, because the others that are still running continue to provide service.

The server farm approach has its downsides–the overall power consumption, rack space, infrastructure cabling issues, etc. One thing to consider when making these decisions involves reliability, availability and serviceability (RAS), a topic covered in this great whitepaper.

From IBM:

“In IBM’s view, servers must be designed to avoid both planned and unplanned outages, and to maintain a focus on application uptime. From a reliability, availability and serviceability (RAS) standpoint, servers in the IBM Power Systems family include features designed to increase availability and to support new levels of virtualization, building upon the leading-edge RAS features delivered in the IBM
[System p and System i] servers. This paper gives an in-depth view of how IBM creates highly available servers for business-critial applications.”

Many issues are covered here, including dynamic processor sparing, processor recovery, hot node add (add a drawer to a running system) and protecting memory.

More from the whitepaper:

“The overriding design goal for all IBM Power Systems is simply stated: Employ an architecture-based design strategy to devise and build IBM servers that can avoid unplanned application outages. In the unlikely event that a hardware fault should occur, the system must analyze, isolate and identify the failing component so that repairs can be effected (either dynamically, through “self-healing” or via standard service practices) as quickly as possible – with little or no system interruption. This should be accomplished regardless of the system size or partitioning.”

How much downtime you can afford is something that each company must determine for itself. The question revolves around the total cost of ownership. What do you need your machine to do to support your business? What kind of performance are you looking for? What kind of reliability are you looking for? Ultimately, this will tell you the amount of downtime you can tolerate.