More Terrifying Tales of IT

Edit: We see these stories these days when ransomware takes out critical systems.

Originally posted March 31, 2015 on AIXchange

I enjoy reading IT-related horror stories, especially those that hit close to home. For me, the best thing about these stories is figuring out what went wrong and then incorporating those lessons into my own environments. Here are a couple of good reads that I want to share.

First, from Network World:

        “Our response to the outage was professional, but ad-hoc, and the minutes trying to resolve the problem slipped into hours. We didn’t have a plan for responding to this type of incident, and, as luck would have it, our one and only network guru was away on leave. In the end, we needed vendor experts to identify the cause and recover the situation.
        Risk 1: The greater the complexity of failover, the greater the risk of failure.
        Remedy 1: Make the network no more complex than it needs to be.
        Risk 2: The greater the reliability, the greater the risk of not having operational procedures in place to respond to a crisis.
        Remedy 2: Plan, document and test.
        Risk 3: The greater the reliability, the greater the risk of not having people that can fix a problem.
        Remedy 3: Get the right people in-house or outsource it.”

I’ve always said that having a test system is invaluable, but simply having the system available to you isn’t enough. You must also make the time to use it, play with it, blow it up. And you absolutely cannot allow your test box to slowly morph into a production server.

This ComputerWorld article tells an even scarier tale of a hospital that was forced to go back to all paper when its network crashed. Though this incident occurred back in 2002, I believe it’s still relevant reading. Technology today is more reliable than ever, but troubleshooting is a skill we’ll always need.

        “Over four days, Halamka’s network crashed repeatedly, forcing the hospital to revert to the paper patient-records system that it had abandoned years ago. Lab reports that doctors normally had in hand within 45 minutes took as long as five hours to process. The emergency department diverted traffic for hours during the course of two days. Ultimately, the hospital’s network would have to be completely overhauled.
        First, the CAP team wanted an instant network audit to locate CareGroup’s spanning tree loop. The team needed to examine 25,000 ports on the network. Normally, this is done by querying the ports. But the network was so listless, queries wouldn’t go through.
        As a workaround, they decided to dial in to the core switches by modem. All hands went searching for modems, and they found some old US Robotics 28.8Kbps models buried in a closet. Like musty yearbooks pulled from an attic, they blew the dust off them. They ran them to the core switches around Boston’s Longwood medical area and plugged them in. CAP was in business.
        In time, the chaos gave way to a loosely defined routine, which was slower than normal and far more harried. The pre-IT generation, Sands says, adapted quickly. For the IT generation, himself included, it was an unnerving transition. He was reminded of a short story by the Victorian author E.M. Forster, “The Machine Stops,” about a world that depends upon an uber-computer to sustain human life. Eventually, those who designed the computer die and no one is left who knows how it works.
        He found himself dealing with logistics that had never occurred to him: Where do we get beds for a 100-person crisis team? How do we feed everyone?

        Lesson 1: Treat the network as a utility at your own peril.
        Actions taken:
        1. Retire legacy network gear faster and create overall life cycle management for networking gear.
        2. Demand review and testing of network changes before implementing.
        3. Document all changes, including keeping up-to-date physical and logical network diagrams.
        4. Make network changes only between 2 a.m. and 5 a.m. on weekends.

        Lesson 2: A disaster plan never addresses all the details of a disaster.
        Actions taken:
        1. Plan team logistics such as eating and sleeping arrangements as well as shift assignments.
        2. Communicate realistically—even well-intentioned optimism can lead to frustration in a crisis.
        3. Prepare baseline, “if all else fails” backup, such as modems to query a network and a paper plan.
        4. Focus disaster plans on the network, not just on the integrity of data.”

Anyone who’s spent even a few years in our profession has at least one good horror story. What’s yours? Please share it in comments.