Edit: I should still be better prepared for my rides in the desert
Originally published by TechChannel May 18, 2021
Rob McNelly on why adequate system maintenance can help you prepare (and avoid) system problems
As I found myself walking through the desert, pushing a bicycle with a flat tire, I wondered how I got here.
That’s not a metaphor; that’s life in Arizona. With the cactus and tumbleweeds and other assorted spiky, poky things, it’s a dangerous place for tires and tubes. It’s not a great place for people, either, once the temperatures hit triple digits, as was the case that day.
I probably could have called someone to pick me up, but that would just add insult to injury. Besides, I didn’t have far to walk, and I had enough water on hand. So I trudged home in the heat, wondering how I could have avoided this fate.
For starters, I could have filled my tire tube with slime. That’s a real thing. It seals any holes in tire tubes. I wouldn’t have needed to do anything else.
Or I could have brought along a spare bike tube, pump, and patch kit. Then I could make repairs on the spot. Or I could have simply replaced the tire itself. It was old and the tread was thinning.
Ultimately I realized this was on me. This outcome was entirely foreseeable, and I’d neglected to adequately prepare for it.
Does that sound familiar? Isn’t this often the case with your system maintenance? Once you determine the cause of the problem, it’s glaringly obvious that something was neglected along the way. Say you open a PMR and IBM informs you that your issue is a known defect. Had you patched your system when that SP or TL was first released, months or even years prior, the bug would not have affected you. Or maybe you need a physical frame taken down so a CE can replace a part. Many components are “hot swappable” these days, but not everything. Wouldn’t it have been nice if you’d have prepared for this eventuality by simply keeping a spare frame with free resources available in your environment? That way you could LPM the workload to it, and the necessary work wouldn’t affect anyone at all.
But rather than plan, you hoped for the best. Or perhaps you concluded that if ain’t it broke, don’t fix it. And the outcome was entirely foreseeable—as well as entirely avoidable.
Make the Case for Maintenance
Back in the 1980s Castrol Motor Oil reminded TV viewers that “if you make things hard on your engine, your engine will make things hard on you.” (While a surprising number of old Castrol ads are archived on YouTube, I couldn’t find that particular one. But this is similar.) Like your car, your system is a valuable and complex piece of machinery that requires care. Fall too far behind on patching and basic maintenance, and the simple becomes much more complicated than it should be. Updates become large upgrade projects. Ignoring maintenance will ultimately leave you running old, unsupported hardware and out of date OSes, with no easy path forward. Technical debt will be paid, one way or another.
What should be done? Start by letting those at the C-level know that patching is important, and potentially a huge cost savings over inaction. Maybe use the car analogy in a gentle reminder. If you can see the need to change your oil or fill your tank (or charge your battery in the case of electric vehicles), you should be able to see the need for system maintenance. You know you can’t ignore that oil change reminder sticker in your windshield. Sure you can put it off, but not forever. The same applies to your machines.
We may need to convince others of the importance of maintenance, but we admins should know from bitter experience why it matters. It’s a sinking feeling when you realize how easy it would have been to ensure your backups would function before the need arose to actually restore your machine. What about that mksysb you’re taking? Have you audited it to make sure the images are not only being created, but are actually usable? Sure, by writing a script you theoretically can set it and forget it in cron, but neglect to check on the results of that backup script, and you’ll end up with Schrodinger’s backup:
“The condition of any backup is unknown until a restore is attempted.”
Certainly your VIO mappings and VLANs and other configurations are saved somewhere, right? And your VIO servers are backed up as well, so they can be easily recreated if needed? What about your HMC? Is it easily recoverable? Are the configurations backed up? Did you run and keep hmcscanner reports so you know how everything was set up in your environment? Do you have the information so that shared Ethernet adapters and etherchannel devices can be recreated if needed? Have you actually done this? What about mapping your NPIV and vSCSI disks? And do you know which disk drivers you need to load?
There’s even more to consider, but you get the point. It all boils down to being prepared for the unexpected. Any and all of these problems can leave you on a slow, humbling walking through the desert, asking yourself why you weren’t better prepared and hoping you have enough water to see you through.
And that is a metaphor.