Edit: Still good stuff.
Originally posted September 2006 by IBM Systems Magazine
I always liked the saying that “a lazy computer operator is a good computer operator.” Many operators are always looking for ways to practically automate themselves out of a job. For them, the reasoning goes: “why should we be manually doing things, if the machine can do them instead?”
A few hours spent writing a script or tool can pay for itself very quickly by freeing up the operator’s time to perform other tasks. If you set up your script and crontab entry correctly, you can let your machine remember to take care of the mundane tasks that come up while you focus on more important things, with no more forgetting to run an important report or job. Sadly, even the best operator with the most amazing scripts and training will need help sometimes, at which point it’s time for the page out.
In our jobs as system administrators, we know we’re going to get called out during off hours to work on issues. File systems fill up, other support teams forget their passwords or lock themselves out of their accounts at 2 a.m., hardware breaks, applications crash. As much as we would love to see a lights out data center where no humans ever touch machines that take care of themselves, the reality is that someone needs to be able to fix things when they go wrong.
We hate the late night calls, but we cope with them the best we can. Hopefully management appreciates the fact that many of us have families and lives outside of work. We are not machines, or part of the data center. We can’t be expected to function all day at work, then all night after getting called out. It’s difficult to get back to sleep after getting called out, and it impacts our performance on the job the day after we are called or, worse, it ruins our weekends. However, our expertise and knowledge are required to keep the business running smoothly with a minimum of outages, which is all factored into our salaries.
I have seen different methods used, but it’s basically the same. Each person on the team gets assigned a week at a time, with some jockeying around to try to schedule our on-call weeks to avoid holidays, and usually people can work it all out at the team level. In one example, I even saw cash exchanging hands to ensure that one individual was able to skip his week. Whatever method is used, the next question revolves around how long you’re on call. Is it 5 p.m. – 8 a.m. M-F and all day Saturday and Sunday? Is it 24 x 7 Monday through Monday? Does the pager or cell phone get handed off on a Wednesday? Do we use individual cell phones or a team cell phone? They are all answers to the same question, and you have to find the right balance for the number of calls you deal with off-shift and the on-call workload during the day.
On call rotation is the bane of our existence, but we can take steps to reduce the frequency of the late night wake up calls. If we have stable machines with good monitoring tools and scripts in place, that can go a long way towards eliminating unnecessary callouts. Having a well-trained first-level support, operations, or help desk staff can also help eliminate call outs.
In a perfect world, a monitoring tool like NetView or OpenView or Netcool is in place monitoring the servers, where all of the configurations are up to date and all of the critical processes and filesystems are being monitored. When something goes bad, operations sees the alert, and they have good documentation, procedures and training in place to do some troubleshooting. Hopefully they’ve been on the job for a while and know what things are normal in this environment, and they can quickly identify when there is a problem. For routine problems, you have given them the necessary authority (via sudo) or written scripts for them to use to reset a password, reset a user’s failed login count, or even add space to a filesystem if necessary.
I spent time in operations early in my career, and learned a great deal from that opportunity. I remember it was a great stepping stone: many of my coworkers got their start working 2nd and 3rd shift in operations positions. This was a great training ground, but all of the good operators were quickly “stolen” to come work in 2nd and 3rd level support areas.
If another support team needs to get involved, operations pages them and manages the call. Then the inevitable happens: someone needs to run something as root, or they need our help looking at topas or nmon, etc. Hopefully they were granted sudo access to start and stop their applications, but sometimes things just are not working right, and that’s when they page the system administrator. Ideally, by the time we’ve been paged, first level support has done a good job with initial problem determination, the correct support team has been engaged, and by the time they get to us, they know what they need for us to do and it will be a quick call and we can go back to sleep.
Sometimes, it’s not a quick call, where nobody knows what’s wrong and they’re looking to us to help them determine if anything is wrong with the base operating system. In a previous job, I used a tool that kept a baseline snapshot of what the system should look like normally. It would know what filesystems should be mounted, what the network looked like, which applications were running and saved that information to a file. When run on the system in its abnormal state, it was easy to see what was not running, which made finding a problem very simple. Sometimes, however, this did not find anything either, which is where having a good record of all the calls that have been worked on by the on-call team is a godsend. A quick search for the hostname would bring back hits that could give a clue as to problems others on your team had encountered, and what they had done to solve them.
At some point, the problem will be solved, everyone will say it’s running fine, and everyone will hang up from the phone call (or instant messaging chat, depending on the situation) and go to bed. Hopefully, as the call was ongoing, you were keeping good notes and updating your on call database with the information that will be helpful to others to solve the problem in the future. Just typing in “fixed it” in the on call record will not help the next guy who gets called on this issue nine months down the road.
Hopefully you are having team meetings, and in these meetings you are going over the problems your team has faced during the last week of being on call, and the solutions that you used to solve them. There should be some discussion as to whether this is the best way to solve it in the future, and whether any follow-up needs to happen. Do you need to write some tools, or expand some space in a file system, or to educate some users or operations staff? Perhaps you need to give people more sudo access so they can do their jobs without bothering the system admin team.
Over time, the process can become so ingrained that the calls decrease to a very manageable level. Everyone will be happier, the users will have machines that don’t go down, and if/when they do, operations is the first team to know about it. The machines can be proactively managed which will save the company from unnecessary downtime.