Edit: This can be less of an issue when things are more automated, but it is still worth consideration.
Server build standards simplifies the process of supporting IT environments.
Originally posted December 2006 by IBM Systems Magazine
Note: This is the first of a two-part article series. The second part will appear in the January EXTRA.
There are still small organizations with one or two full time IT professionals working for them. They may find they are able to make things work with a minimum of documentation or procedures. Their environment may be small enough that they can keep it all in their heads with no real need for formal documentation or procedures. As they continue to grow, however, they may find that formal processes will help them, as well as the additional staff that they bring on board. Eventually, they may grow to a point where this documentation is a must.
The other day I was shutting down a logical partition (LPAR) that a co-worker had created on a POWER5 machine. A member of the application support team had requested we shut down the LPAR as some changes had been made, and they wanted to verify everything would come up cleanly and automatically after a reboot. We decided to take advantage of the outage and change a setting in the profile and restart it. To our surprise, after the LPAR finished its shutdown, the whole frame powered off. When you go into the HMC and right-click on the system, and select properties, you see the managed system property dialog. On the general tab, there’s a check-box that tells the machine to power off the frame after all logical partitions are powered off. During the initial build, this setting was selected, and our quick reboot turned into a much longer affair as the whole frame had to power back up before we were able to activate our partition. This profile setting had not been communicated to anyone, and we had mistakenly assumed it was set up like the other machines in our environment.
This scenario could have been avoided had there been good server build standards in place, along with a mechanism to enforce those standards. Our problem wasn’t that the option was selected, but that there was no good documentation in place that specified exactly what each setting should look like and why. Someone saw a setting and made their best guess as to what that setting should be, and then that decision was not communicated to the rest of the team. One of the problems with having a large team is people can make decisions like these without letting others know what has taken place. Unless they have told other people what they’re doing, other members of the team might assume the machine will behave one way when, in actuality, it’s be set up in another way.
Making A List, and Checking It Twice
Checklists and documentation are great, as long as people are actually doing what they are supposed to. Some shops have a senior administrator write the checklist, and a junior administrator build the machine, while another verifies the build was done correctly. A problem can crop up when a senior administrator asks for something in the checklist without explaining the thinking behind it. He understands why he has asked for some setting to be made, or some step to be taken, but nobody else knows why it’s there. The documentation should include what needs to change, but also why it needs to be changed. If it’s clear why changes are made, people are more apt to actually follow through and make sure all the servers are consistent throughout the environment. If the answer they get is to just do it, they might be less likely to bother with it since they don’t understand it anyway. The person actually building the machine might not think it’s important to follow through on, which leads to the team thinking a server is being built one way, when the finished product does not actually look the way the team as a whole thought it would.
The team also needs to be sure to keep on top of the checklist, as this is a living document that will be in a constant state of flux. As time goes on, if that checklist is not kept up to date, things can change with the operating system and maintenance level patches that either make that setting obsolete, or the setting starts causing problems instead of fixing them. The decision could have been made to deploy new tools, or change where logfiles go, or standard jobs that run out of cron. If these changes are not continually added to the checklist, the new server builds no longer sync with those in production. This is equally important when decommissioning a machine. There are steps that must be taken, and other groups that need to be notified. The network team might need to reclaim network cables and ports. The SAN team may need to reclaim disk and fiber cables. The facilities team may need to know this power is no longer required on the raised floor. To put it simply: A checklist that’s followed can ensure these steps get completed. Some smaller shops may not have dedicated teams to do these things, in which case it might just be a case of reminding the administrators they need to take care of these steps.
Another issue can crop up when the verifier is catching problems with the new server builds, but isn’t updating the documentation to help clarify settings that need to be made. If the verifier is consistently seeing people forgetting to change a setting, they should communicate what’s happening to the whole team, why it needs to happen during the server build, and then update the documentation to more clearly explain what needs to be done during the initial server build. What’s the point of a verifier catching problems all the time, but then not making sure the documentation is updated to avoid these problems from cropping up in the future?
Having these standards makes supporting the machines much easier, as all of the machines look the same. Troubleshooting a standard build is much easier, as you know what filesystems to expect, how volume groups are set up, where the logs should be, what /etc/tunables/nextboot looks like, and so on. Building them becomes very easy, especially with the help of a golden image. I think it’s essential to have infrastructure hardware you can use to test your standard image. This hardware can be dedicated to the infrastructure or an LPAR on a frame but, in either case, you set up your standard image to look exactly as you want all of your new servers to look, and make a mksysb of it. Then use that on your NIM server to do your standard loads. Instead of building from CD, or doing a partial NIM load with manual tasks to be done after the load, keep your golden image up to date and use that instead. Keep the manual tasks that need to happen after the server build to an absolute minimum, which will keep the inconsistencies to a minimum as well. When patches come out, or new tools need to be added to your toolbox, make sure – besides making that change to the production machines – you’re updating your golden image and creating a more current mksysb.
In next month’s article, I’ll further explore the benefits of establishing good server build standards and checklists.