System Monitoring Shouldn’t Be Neglected

Edit: You still find this phenomenon, and it still surprises me.

Originally posted July 15, 2014 on AIXchange

What are you doing to monitor your systems from both the hardware and OS levels? Are you using a commercial product? Are you using an open source product? Are you using hand-built scripts that run from cron? Are you using anything?

Have you logged into your HMC lately? Does anything other than green appear in the system status, attention LEDs or Serviceable Events sections of the display? Countless times I’ve seen machines where the HMC messages were being ignored. Is your HMC set up to contact IBM when your servers run into any issues?

When your machines have issues, are you deluged with alerts? One customer I know of had a script that monitored their machine and sent emails when errors were detected. During one event, the PowerHA system actually failed over because the node became unresponsive due to the volume of errors being generated and the way the script was written. This forced the customer to go into the mail queue and clean up a huge number of unsent messages. Then they had to go into the email client and clean up all of the messages they’d received. Finally, they had to schedule downtime to fail the application back to the node it was supposed to be running on.

I know of multiple customers that simply route error messages to a mail folder — and then never bother checking them. What’s the point of monitoring a system if you never analyze the information you collect?

How diligent are you about deactivating monitoring during periods of scheduled maintenance? In many organizations where a help desk monitors systems, cycles are wasted because techs are so often called to follow up on alerts and error messages triggered by scheduled events.

Of course there are other impacts that can result from neglecting systems. If internal disks are going bad, and you’re not monitoring and fixing them, eventually you will lose your VIOS rootvg (assuming that’s how you have it set up). And just as some customers will ignore the system monitoring messages they collect, other customers don’t take action on hardware events that are being logged. Having robust hardware that notifies you when it needs maintenance is only useful if you actually heed the notifications.

Deploying your OS and installing your application is relatively simple, but along with that we must make decisions and take actions to manage and maintain these systems during the operational production phase of service. Sure, everyone is busy, and some tools cost money — but try explaining that to someone who cares when production goes down.

On a totally unrelated topic, I want to acknowledge that AIXchange is having a birthday. Seven years ago this week — July 16, 2007 — the first article was posted on this blog. Many thanks to everyone who takes the time to read this blog, and special thanks to those who have suggested topics. I welcome your input, and it does make a difference.

Here’s to the next seven years.