Edit: Monitoring goes a long way. Some links no longer work.
Originally posted July 22, 2014 on AIXchange
A customer planned to use Live Partition Mobility (LPM) to move running workloads from frame 2 to frame 1. The steps were: shutdown frame 1, physically move frame 1, recable frame 1 and power it back on, then use LPM to bring the workload from frame 2 to frame 1, and, finally, repeat the process to physically move frame 2.
The task at hand was simple enough, but there was a problem. The physical server that was being moved had been up for 850 days. Do not make the mistake of moving a machine that’s been running continuously for more than two years without first logging in and checking on the server’s health. Furthermore, make sure you’ve setup alerting and monitoring of your servers.
I got a call after step one of the customer’s plan was complete and the damage had been done. Nonetheless, much can be learned from this episode.
Was errpt showing tons of unread errors? Yes. Had the error log been looked at? No. Had someone cleared the error log before support got involved with the issue? Yes. Was support still able to help? Yes. When you send a snapshot to IBM support, they can access the error log even if it’s been cleared from the command line, assuming those errors have not been overwritten in the actual error log file in the meantime.
Were there filesystems full? Yes. In this case one of the culprits was the /var/opt/tivoli/ep/runtime/nonstop/bin/cas_src.sh script, which wrote a file — /dev/null 2>&1 — that filled up the / filesystem.
To make matters worse, the machines are part of a shared storage pool, and after the physical move frame 1 would not rejoin the shared storage pool (SSP) cluster. This left only two of four VIO servers as part of the SSP.
It turned out that after the physical move, the network ports weren’t working. As a result, Multicast wasn’t working. At least getting Multicast back up was easy enough. However, the two VIO servers were still unable to join the cluster, and the third VIO server on frame 2 (vio3) had protected itself by placing rootvg in read-only mode as it logged physical disk errors. So from a 4-VIO server cluster, only one was actually functional, and that one had its own issues. If things weren’t fixed quickly, production would be impacted.
The problem with the one operable VIO server was, because it switched to read-only, SSP errors were occurring whenever someone tried to start or stop any of the cluster nodes. In other words, it was keeping the cluster in a locked state:
clstartstop -start -n clustername -m vio3
cluster_utils.c get_cluster_lock 6096 Could not get lock: 2
clmain.c cl_startstop 3030 Could not get clusterwide lock.
Fortunately, rebooting the third VIO server cleared up this issue. And with that, the other VIO servers came back into the SSP cluster. Ultimately, the customer was able to use LPM to move clients to frame 1, which had already been physically moved. This allowed the customer to then shut down frame 2 and physically move it as well.
So what have we learned? Check your error logs. Check your filesystems. Schedule the occasional reboots of your machines. Make sure you’re applying patches to your VIO servers and LPARs. Make sure you have good backups.
Finally, note that in this instance, having the capability to perform LPM operations really made a huge difference. Despite the severity of these problems, the users of these systems had no idea that anything had been going on at all.