Edit: I also like HMCscanner output
Originally posted October 2, 2012 on AIXchange
Once I was called in to help a customer that had lost its AIX support staff. I won’t go into the details; just understand that in this case, quite a bit of knowledge vanished overnight and had to be re-created.
We had to figure out passwords and LPAR configurations. Multiple profiles were associated with each LPAR, and there was no one who could answer our questions. The only way to determine how the profiles were created — or even they were even still active — was to go into each one and look for recent updates. From there, we were left to making educated guesses.
We had to figure out how to connect to the HMC, both locally and remotely. We had to verify network addresses for the HMC as well as the various LPARs. We had to find the user IDs on the various systems that had escalated authority.
Physical connectivity was another puzzle. We found that there were two HMCs in a rack, but only one monitor and keyboard. It turns out the customer was using a KVM in the environment and employed a non-standard way of switching between the different sessions.
We had to figure out how to connect to the storage array, and then determine how the storage was allocated to the servers.
Luckily for us, no lasting damage was done, we were able to recover the passwords and get into the systems. Of course, it did take some time and effort. We didn’t have the luxury of being able to check a runbook, wiki or some other document.
When you build and maintain your own systems, you “just know” all of this information. When you’re a consultant like me and you come into an environment cold, there’s generally someone who can give you this information. Not that it was available in this case, but even documentation can be tricky. Don’t get me wrong, documentation is very valuable — provided it’s current. But outdated documentation is practically worthless, if not actually harmful. It can lead to bad assumptions, which can lead to bad actions, which generally result in system outages.
One tool I rely on in situations like this is HMC sysplans, which provides a snapshot of a machine’s configuration. I also run scripts on all machines so I can have current output. That’s the best way to identify what should be on a machine (or at least, what was on a machine).
Ultimately, though, this is yet another example of why maintaining current documentation is so vital. What kind of critical business information exists only in the memories of your IT staffers? If you were hit by a train, what would be lost?
So what do you do to document the needs and inner-workings of your environment? How frequently do you update this information?