Skeptic to Believer: Live Partition Mobility Has Many Potential Uses

Edit: I feel like I was starting to hit my stride this week and the week before. The topics and the content got a little meatier as time went on. It is hard to believe how exciting it was the first time I saw Live Partition Mobility in action. It is interesting to see that the scenarios I described are very common these days, with much of it being automated and much faster than it was in the POWER6 days. I was able to dig up a link (included below) to the press release: IBM Demonstrates a UNIX Virtualization Exclusive, Moves Workloads From One Machine to Another — While They’re Running

Originally posted September 10, 2007 on AIXchange

When I was in Austin, Texas, recently for a technical briefing, IBM demonstrated how you can move workloads from one machine to another. They call it Live Partition Mobility.

I saw it in action and I went from skeptic a believer in a matter of minutes. I kept saying things like: “This whole operation will take forever.” “The end users are going to see a disruption.” “There has to be some pain involved with this solution.” Then they ran the demo. 

They had two POWER6 System p 570 machines connected to the hardware-management console (HMC). They started a tool that simulated a workload on one of the machines. They kicked off the partition-mobility process. It was fast, and it was seamless. The workload moved from the source frame to the target frame. Then they showed how they could move it from the target frame back to the original source frame. They said they could move that partition back and forth all day long. (Ask your business partner or IBM sales rep to see a copy of the demo. There’s a flash-based demo that was recorded to show customers a demo. I’m still waiting for it to show up on YouTube.)

The only pain that I can see with this solution is that the entire partition that you want to move must be virtualized. You must use a virtual I/O (VIO) server and boot your partition from shared disk that’s presented by that VIO server, typically a storage-area network (SAN) logical unit number (LUN). You must use a shared Ethernet adapter. All of your storage must be virtualized and shared between the VIO servers. Both machines must be on the same subnet and share the same HMC. You also must be running on the new POWER6 hardware with a supported operating system. 

Once you get everything set up, and hit the button to move the partition, it all goes pretty quickly. Since it’s going to move a ton of data over the network (it has to copy a running partition from one frame to another), they suggest that you be running on Gigabit Ethernet and not 100 Megabit Ethernet.

I can think of a few scenarios where this capability would be useful:

The next time errpt shows me I have a sysplanar error. I call support and they confirm that we have to replace a part (which usually requires a system power down). I just schedule the CE to come do the work during the day. Assuming I have my virtualization in place and a suitable machine to move my workload to, I just move my partition over to the other hardware while the repair is being carried out. No calling around the business asking for maintenance windows. No doing repairs at 1 a.m. on a Sunday. We can now do the work whenever we want as the business will see no disruption at all.

Maybe I can run my workload just fine for most of the time on a smaller machine, but at certain times (i.e., month end), I would rather run the application on a faster processor or a beefier machine that’s sitting in the computer room. Move the partition over to finish running a large month-end job, then move it back when the processing completes.

Maybe it’s time to upgrade your hardware. Bring in your new machine, set up your VIO server, move the partition to your new hardware and decommission your old hardware. Your business won’t even know what happened, but will wonder why the response time is so much better.

What happens if you’re trying to move a partition and your target machine blows up? If the workload hasn’t completely moved, the operation aborts and you continue running on your source machine.   

This technology isn’t a substitute for High Availability Cluster Multi-Processing (HACMP) or any kind of disaster-recovery situation. This entire operation assumes both machines are up and running, and resources are available on your target machine to handle your partition’s needs. Planning will be required.

I know I haven’t thought of everything. Let me know what scenarios you come up with for this useful tool.