Managing Servers a Unique Challenge for Small Shops

Edit: The first two links do not work. The Netview link no longer works. The first ganglia link no longer works. I still called it an AS/400.

Originally posted April 27, 2010 on AIXchange

It seems like every customer I talk to has a different method for managing their servers.

For a large data center, the challenges are apparent. There are hundreds, even thousands of servers. Some are standalone servers, some have virtual I/O servers with many client LPARs. As the number of servers grows, getting a handle on these environments can be difficult.

However, smaller shops face their own issues. Many manage their servers without the benefit of standard management tools. If you’re in this situation, you should be aware of some of the available options. For starters, built-in tools like syslog and errpt can alert us when problems occur.

We can also roll our own scripts and parse our own logs and manage our own machines without any help from anyone outside of our organization — assuming, of course, that we have the time to work on our scripts.

However, the many organizations lacking the time and/or skills to create their own tools want to be able to purchase software to help them with this task. Certainly IBM Systems Director comes to mind, but recently I was asked about other monitoring tools.

As I’m focused heavily on AIX running on POWER servers, my responses were confined to that platform. I immediately thought of Tivoli software, as I had administered Tivoli NetView once upon a time.

According to the Web site, “this system monitoring software (can) manage operating systems, databases and servers in distributed and host environments. It provides a common, flexible and easy-to-use browser interface and customizable workspaces to facilitate system monitoring. It detects and recovers potential problems in essential system resources automatically. It offers lightweight and scalable architecture, with support for IBM AIX, Solaris, Windows, Linux and IBM System z monitoring software.”

I’ve also read about software called Ganglia. I’ve even seen it in action. Though its creators tout it as “a scalable distributed monitoring system for high-performance computing systems such as clusters and grids,” it’s capable of monitoring performance across POWER machines.

Beyond that though, I drew a blank. What other toolsets are out there? What are we relying upon to manage and monitor our systems?

Back when I worked on the AS/400 system, I loved the Robot/Alert and Robot/Console products.

Hopefully I’m not misremembering, but I seem to recall being able to automate the answering of console messages and redirect operator messages to an alphanumeric pager. Back in the early ’90s, this was a handy way to have my machine page me and tell me what was wrong. With a quick glance at my pager, I knew whether the issue required an immediate response or if it could wait a bit. I’m sure the current iteration of the product offers many more powerful features of which I am not aware.

What do you think? What’s the dominant monitoring software package for Power Systems? What are you using? Send me an e-mail or leave a comment. While you’re at it, are there tools you tried and didn’t like? Or, if you had a wish list, what features would you like to see included in monitoring software?