Edit: Still important to consider.
Originally posted February 19, 2013 on AIXchange
Do you update your systems? Do you patch your machines monthly? Quarterly? Annually? Do you ever patch?
Are change windows built into your environment (e.g., there’s scheduled system maintenance, say, the third Sunday of each month)? Is it too difficult to get the various applications owners to agree to a set downtime because you have so many different LPARs running on your physical frame? Is downtime simply not allowed in your environment?
Over the years I’ve met a number of people who live by the “if it ain’t broke, don’t fix it” adage. What’s funny is oftentimes the older a system gets, the more reluctant customers are to maintain it. Logically these systems have a greater need for attention than something just out of the box. Of course we’ve all used, seen or at least heard about systems that just kept running. Recently I saw F40s that are still in production, still running AIX 4.3 and still chugging along. And sure, they can keep going for a long time to come. We are fortunate enough to work with incredibly powerful and well-built hardware.
But just think about an older system — not only the hardware that’s running old microcode, but the HMC that’s running old code, the operating system that hasn’t been patched and the application that hasn’t been updated. Even if the machine isn’t visible to the Internet, there’s still great potential for things to go wrong. And if something does go wrong, how would you respond?
Customers in this situation know they’re on their own, and they’re OK with it. Typically I’m told that the application vendor is no longer in business, so they can’t get support for that code anyway. If their hardware dies, they hope they can find someone who can help them — someone who’s familiar with the limitations of older OS versions. They hope they can still get parts for their old hardware. (Along those lines, I know of folks who buy up duplicate servers just so they can have parts available to swap out. I just hope that these customers realize that tearing out part of an old machine and successfully putting it into another old machine is a unique skill.)
So I’ve heard it all, but I’ll never truly understand people who would take these chances. Why rely on hope? There are alternatives — alternatives that don’t involve buying all new systems.
For instance, if you’re running AIX 5.2 or 5.3, you can move onto newer POWER7 hardware by utilizing versioned WPARs. This allows you to keep running your older code on newer, supported versions of the operating system, which in turn provides you with some limited support options.
Many of us who’ve called IBM Support learned that our issue was a known problem that was addressed with an operating system fixpack or firmware update. That’s the advantage of paying for regular maintenance. Updates to your machines and operating systems take care of the known issues.
Of course some will then argue that making these types of changes could introduce new bugs or issues that would have been avoided by not fixing what wasn’t broken. My response to this argument is that test and QA systems are really important. Implement your changes on these boxes first; then move them into production.
Some methods to consider for hardware maintenance include Live Partition Mobility (LPM) or PowerHA. With LPM you can evacuate running LPARs onto other hardware with no downtime, conduct maintenance on your source hardware and then move the LPARs back to the original hardware. Using PowerHA you can move your resource group to a standby node, conduct maintenance on your original node and then move your resource group back. In this case a short outage for the application to restart each time the resource group moves is required, but PowerHA is much faster than some alternatives.
(Note: Whether or not you’re doing maintenance, periodically moving your resource groups around in a PowerHA cluster is a good idea. By doing this you can make sure that the failover actually works, and that changes haven’t occurred on node A weren’t replicated on node B.)
For OS upgrades you might use alt disk copy or multibos to update your rootvg volume group by making a copy of it and updating that copy. You can boot from that copy after the update, and if anything goes wrong, you can quickly change your boot list and return to the original boot disk. This would simplify your backout process if you needed to go back for any reason.
So where do you stand on patching? Let me know in the comments.