Analyzing Live Partition Mobility

Edit: This is taken for granted now. Some links no longer work.

While I was in Austin, one of the things that IBM demonstrated was how you can move workloads from one machine to another. IBM calls this Live Partition Mobility. I saw it in action and I went from skeptic a believer in a matter of minutes.

Originally posted November 2007 by IBM Systems Magazine

I was in the Executive Briefing Center in Austin, Texas, recently for a technical briefing. It’s a beautiful facility, and if you can justify the time away from the office, I highly recommend scheduling some time with them in order to learn more about the latest offerings from IBM. From their Web site

“The IBM Executive Briefing Center in Austin, Texas, is a showcase for IBM System p server hardware and software offerings. Our main mission is to assist IBM customers and their marketing teams in learning about new IBM System p and IBM System Storage products and services. We provide tailored customer briefings and specialized marketing events.

“Customers from all over the world come to the Austin IBM Executive Briefing Center for the latest information on the IBM UNIX-based offerings. Here they can learn about the latest developments on the IBM System p and AIX 5L, the role of Linux and how to take advantage of the strengths of our various UNIX-capable IBM systems as they deploy mission-critical applications. Companies interested in On Demand Business capabilities also find IBM System p offers some of the most advanced self-management features for UNIX servers on the market today.”

While I was in Austin, one of the things that IBM demonstrated was how you can move workloads from one machine to another. IBM calls this Live Partition Mobility.

I saw it in action and I went from skeptic a believer in a matter of minutes. At the beginning, I kept saying things like, “This whole operation will take forever.” “The end users are going to see a disruption.” “There has to be some pain involved with this solution.” Then they ran the demo.

The presenters had two POWER6 System p 570 machines connected to the hardware-management console (HMC). They started a tool that simulated a workload on one of the machines. They kicked off the partition-mobility process. It was fast, and it was seamless. The workload moved from the source frame to the target frame. Then they showed how they could move it from the target frame back to the original source frame. They said they could move that partition back and forth all day long. (Ask your business partner or IBM sales representative to see a copy of the demo. There’s a Flash-based demo that was recorded to show customers a demo. I’m still waiting for it to show up on YouTube.)

The only pain that I can see with this solution is that the entire partition that you want to move must be virtualized. You must use a virtual I/O (VIO) server and boot your partition from shared disk that’s presented by that VIO server, typically a storage-area network (SAN) logical unit number (LUN). You must use a shared Ethernet adapter. All of your storage must be virtualized and shared between the VIO servers. Both machines must be on the same subnet and share the same HMC. You also must be running on the new POWER6 hardware with a supported OS.

Once you get everything set up, and hit the button to move the partition, it all goes pretty quickly. Since it’s going to move a ton of data over the network (it has to copy a running partition from one frame to another), they suggest that you be running on Gigabit Ethernet and not 100 Megabit Ethernet.

I can think of a few scenarios where this capability would be useful:

The next time errpt shows me I have a sysplanar error. I call support and they confirm that we have to replace a part (which usually requires a system power down). I just schedule the CE to come do the work during the day. Assuming I have my virtualization in place and a suitable machine to move my workload to, I just move my partition over to the other hardware while the repair is being carried out. No calling around the business asking for maintenance windows. No doing repairs at 1 a.m. on a Sunday. We can now do the work whenever we want as the business will see no disruption at all.

Maybe I can run my workload just fine for most of the time on a smaller machine, but at certain times (i.e., month end), I’d rather run the application on a faster processor or a beefier machine that’s sitting in the computer room. Move the partition over to finish running a large month-end job, then move it back when the processing completes.

Maybe it’s time to upgrade your hardware. Bring in your new machine, set up your VIO server, move the partition to your new hardware and decommission your old hardware. Your business won’t even know what happened, but will wonder why the response time is so much better.

What happens if you’re trying to move a partition and your target machine blows up? If the workload hasn’t completely moved, the operation aborts and you continue running on your source machine.

This technology isn’t a substitute for High Availability Cluster Multi-Processing (HACMP) or any kind of disaster-recovery situation. This entire operation assumes both machines are up and running, and resources are available on your target machine to handle your partition’s needs. Planning will be required.

This will be a tool that I will be very happy to recommend to customers.