The Lines Blur Between Prod and Test

Edit: The links to the webinar resolve but are old and do not seem to work. The first link still lists the speakers at the time of this writing.

Originally posted July 19, 2010 on AIXchange

Recently I was helping a customer implement an IBM PowerHA cluster. We were on the whiteboard going over various failover scenarios. There were going to be two physical servers in the environment, and this question came up: “Are we planning to have one frame be the ‘production’ frame and the other be the ‘test/QA’ frame?”

Not that long ago, implementing a test machine alongside a “prod” machine was a given. Hardware simply wasn’t as reliable back then. So, to protect themselves from hardware failure, companies would install a hot standby backup along with their production machine — just in case. Since that backup box typically sat idle, many companies opted to run test workloads on it. At least this way, that second machine was doing something worthwhile.

However, with the advent of Live Partition Mobility and PowerHA — and with more Reliability, Availability and Serviceability (RAS) built into newer hardware — it’s more or less assumed that machines will stay up. And somewhere between then and now, the distinction between prod and test has started to blur.

Almost three years ago I saw my first Live Partition Mobility demo, and I immediately went from skeptic to true believer.

But even now, I find many customers can’t quite believe what they’re seeing. For instance, a few weeks back I was demonstrating how to move a busy LPAR from one frame to another. The customer had the same skepticism I had back at the beginning: Will it work? Will I drop packets? Is this smoke and mirrors and magic? Yes, it works. No smoke, no mirrors — and no dropped packets.

Because you can quickly and easily move workloads around your environment, you’re freed from the entire concept of “this frame is production” and “that frame is test.” You can concentrate on properly mixing workloads across the environment based on need and available resources. You can create uncapped partitions with proper values for the weights of your partitions. If the machine has free cycles, you can allocate them on a very granular level. If one machine becomes constrained, you can easily shift your workload to another frame that can better handle the load.

When my customer and I were discussing PowerHA and whether they wanted the capability of failing multiple LPARs, a comment was made — and a light bulb went on in the minds of those present. What if you set things up the “old way,” your production frame dies for some reason, and you need to failover your prod workload? Should the whole environment failover at once, or would it be preferable to have half of prod failover while the other half keeps on processing? After all, in a mixed environment with production LPARs running on different physical machines, losing a frame means only failing a subset of the environment as opposed to the whole thing.

CPU micro-partitioning, PowerVM server virtualization, Live Partition Mobility and PowerHA are all game changers. When we plan for these technologies, we must also rethink the way our systems are implemented. Though it’s tempting to still think in terms of standalone systems, alternatives are now possible. Rather than separate prod from test, we may find that mixing production with test on the same frame might make perfect sense.

Note: IBM is hosting a pair of webcasts on future trends relating to Power Systems. Register here and here.