Edit: I know we all love continuous uptime, but the time to find out your machine will not boot is during a planned outage, not an unplanned outage.
Originally posted June 8, 2010 on AIXchange
A customer recently performed some scheduled maintenance on a critical server that had an uptime of nearly two years. The customer had created some great scripts that would bring down the application and then connect via ssh to the database server to bring down the database. The application start scripts worked the same way — they’d remotely connect to a database server and bring up the database during the application startup process.
After successfully completing the server maintenance, it was time to bring the application back up. The customer ran the application startup script, but the application didn’t appear to be working properly. After some phone calls to application and database support personnel, it was determined that someone had commented out a line in the startup script. The line that was commented out was the command that would ssh to the database server to start the database, and the application relied on the database in order to work properly.
I’ve said it before: When making changes to a machine, the changes must be tested. Again, in this case, the timestamp on the changed file was nearly two years old. So the change was made, it was never tested, and it was forgotten about. It could have been a simple case of testing something else in the script that affected the startup process and not wanting the script to contact the database server, but once that testing was done, uncommenting the line was forgotten. Since the timestamp was so old, it wasn’t a smoking gun. It didn’t stand out when troubleshooting was done on the issue, so it took a while for someone to actually check the script to see that it did what it should be doing. People assumed that such an old startup script had not been changed, so it should still be working as it would have been used at some point over the last few years.
Although none of us like downtime, especially with resilient servers that “just run,” maintenance windows and application restarts are well worth doing. If we don’t regularly exercise our server shutdowns and startups, we may not uncover a script problem or some other issue until long after the change is made. But by scheduling reboots each month or each quarter, these changes will be more quickly detected and dealt with.
The same holds true with IBM PowerHA clusters. I always like to know that a failover is being regularly tested. The wrong time to find out that something doesn’t work is when the application actually needs to failover.
Having machines that can stay up for years is a tremendous thing. But there’s nothing like the peace of mind that comes from knowing your machines stop and start the way they’re supposed to.