Edit: Are you tracking your system performance?
Originally posted December 2017 by IBM Systems Magazine
At the most recent IBM Technical University event in New Orleans, I was talking with Randy Watson of Midrange Performance Group (MPG). He mentioned that many customers don’t keep any performance data whatsoever.
Randy’s words surprised me. Everyone should carefully track system performance. To say it’s worth the cost and effort is an understatement. This data provides a variety of important benefits.
Three Scenarios
Consider this scenario: Your phone rings in the middle of the night. You’re told that users are having issues with one of your systems. Once you clear your head, you bring up the performance graphs that display your LPAR’s historical data. You can immediately see where and when things changed. Now you’re on track to determine what’s causing the issues you’re seeing.
Performance graphs provide an easy way to visualize your environment running normally compared to how it looks when there’s a problem. Of course in real life things are seldom cut and dried. For instance, subtle changes may require you to go back over longer period of time to find something.
While graphs can prod things in the right direction, they do have their limits. Sometimes graphing your data can hide or at least distort the truth, a point AIX performance expert Earl Jew makes in his articles and lectures. Much of what you see when interpreting graphs will depend on the specific items you’re looking at and the length of your intervals.
Still, performance graphs are typically helpful in these situations. If someone tells you that performance is degraded, you need data. How can you begin to understand the impact of a change to your environment if you have no idea what normal looks like?
Historical data is also useful in areas beyond performance. Here’s another scenario: Management tells you that it’s time to migrate to POWER8 servers, and they want to know what models and hardware components you recommend for the refresh. Naturally, you’re not going to guess. You’ll check the aggregated historical performance data that encompasses all of the LPARs in your environment. You’ll project what workloads can be expected to do over the expected lifespan of the new hardware and estimate the performance gains the new hardware will bring. You’ll give management every reason to take your thoroughly researched recommendations seriously.
And now for one more scenario that highlights one more benefit of consulting performance data: Your enterprise is looking to add new workload to the physical hardware and is considering consolidating some other workloads from other data centers. Where is the best place for this new workload to land? Can existing servers handle additional memory or CPU, or should new adapters be brought in? Is ordering all new server hardware the right answer?
Performance Monitoring Tools
Various products―some fee-based and others that come at no cost―can be deployed to monitor performance and alert you to potential problems. Tools can certainly save you the effort of looking up rperf numbers, creating spreadsheets and guessing.
Cluster management
* Nutanix running on Power is an offering for customers that use Hyperconverged servers, which debuted earlier this year. Nutanix allows you to run capacity planning reports directly from Prism. The reports, which include graphs and charts, will inform you of your capacity usage, projected growth requirements, and are designed to help you manage your cluster resources.
* Ganglia is a cluster monitoring tool that’s designed for AIX high-performance computing (HPC) environments.
https://www.ibm.com/developerworks/community/wikis/home?lang=en_us#!/wiki/Power+Systems/page/Ganglia
Scripts and software
* nnomchart is a Korn shell script for AIX or Linux. It converts nmon collected files to HTML and displays more than 50 AIX and Linux performance graphs and configuration details.
http://nmon.sourceforge.net/pmwiki.php?n=Site.Nmonchart
* lpar2rrd is free software:
The tool offers you end-to-end views of your server environment and can save you significant money in operation monitoring and by predicting utilization bottlenecks in your virtualized environment. You can also generate policy-based alerts, provide capacity reports and forecasting data. The tool supports IBM Power Systems* and VMware* virtualization platforms. It is agentless (it receives everything from the management stations like vCenter or HMC). Collected data set can be extended about data provided by the OS Agents or NMON files.
Vendor Tools
Note that I don’t endorse any of the products listed here, but these commercial solutions are certainly worthy of your consideration.
* Galileo Performance Explorer
* Midrange Performance Group Performance Navigator
In addition, IBM has its performance management product, as well as PowerVP. You could even just activate topas or nmon recording on each of your LPARs.
Setting a Course of Action
Once you choose a product or tool, you then need to decide what you want to accomplish. Is your focus going to be performance monitoring or capacity planning? Are you most interested in graphs and dashboards? Do you want to see trends?
During my discussion with Randy Watson, he mentioned that ultimately, most customers will use performance data either to conduct some kind of server sizing or to implement workload consolidation (scenarios 2 and 3 from earlier). You’ll need to collect data for a reasonable amount of time in order to make any useful projections, so the sizing process in particular can take awhile if you haven’t previously collected data.
According to Randy, the size that the data MPG’s product generates on local disk varies depending on the number of LUNs in the environment, but with 5-minute intervals, 1-5 MB per day can be expected. He said that MPG tries to manage its customers’ historical consolidated files by only keeping 90 days of disk data. In addition, they delete about 20 percent of daily file size by removing redundant data (e.g., configuration data that doesn’t change). A year’s worth of data should be kept by default. For most customers, that amounts to less than 1 GB of data for that consolidated file.
I’m sure other vendors take similar approaches to keep a handle on the amount of data being collected. Then again, with the large disk sizes that are available now, spending a reasonable amount of capacity on historical performance data shouldn’t break anyone’s budget.
Maintaining uptime is important, as is planning for the future. Not only do you need to keep your servers running, you must proactively ready them for what lies ahead. Are you doing your part?