Data Backup Options Balance Risk and Cost

Edit: A backup without a test restore is a wish.

Originally posted February 2019 by IBM Systems Magazine

In some environments, disaster recovery (DR) testing and system rebuilding are ongoing. The most dedicated organizations conduct failover tests and run Live Partition Mobility (LPM) operations to evacuate frames so maintenance can be safely performed. Then LPM is used to put LPARs back onto the frames when the maintenance is complete.

Other environments are much more static. LPARs are built, quarterly or semi-annual patches are applied and that’s it. Of course, far too many environments do no maintenance at all. While regular testing is ideal, this level of activity isn’t practical or even necessary for everyone. Before you ever invest in system availability, understand that you’ll always face risk.

We’ve all been in meetings where recovery time and recovery point objectives are set. How far back should your backups go? That depends: How much data can you afford to lose? You must determine the amount of risk that’s acceptable to your enterprise.

Of course, these decisions are often based on business priorities rather than technical considerations. For instance, in some enterprises where real-time or near real-time data replication is seen as cost-prohibitive, backup tapes are shipped to a DR location and then restored to a secondary machine. This provides an extra layer of protection, but it’s also an example of balancing cost versus risk. In this case, risk is the possibility that data may be lost.

Greatest Investment, Lowest Risk

Not too long ago, maintaining two data centers was seen as an option strictly for huge organizations with large IT budgets, but this practice is relatively mainstream now. Obviously, the benefit of protecting data with a secondary data center is that a disaster or any sort of outage is unlikely to take out both facilities simultaneously. However, these solutions still carry risks. For instance, data corruption is still a possibility. If data is maliciously encrypted or destroyed, it may still be copied to your secondary location. For this reason, offline backups should still be a component of your solution.

Also, keep in mind the importance of testing. If you have a high availability (HA) cluster, fail it over regularly and run production for a period of time on your secondary node. (This assumes your failover node is sized to handle the entire workload rather than only its most critical components.) Same goes with a DR site: fail over to it and run production from the secondary location. Verify that everything works as it should. Don’t wait for an unplanned outage or an actual disaster to learn that a critical piece of infrastructure or data didn’t get replicated. Testing may also reveal technical issues with DNS or network connectivity into the secondary data center, or procedural issues that should be ironed out when personnel are fresh and expecting to troubleshoot issues.

Medium Investment, Reduced Risk

If your OS configuration is fairly static, monthly or weekly OS backups may be sufficient. Again though, you must understand the risks. Obviously, in the event of a restore, you’ll need to reintroduce any changes that occurred since the last OS backup. Beyond that, something could happen to your backup image.

If OS and data backups are written to tape, make sure the tapes are safely labeled and stored offsite, and that you have a recovery plan and a method to quickly access them if needed. Remember: A severe outage might lead to a compete loss of access to your machines and data center. Also remember tapes don’t last forever. Have a plan in place to replace them over time.

On the other end of the spectrum are organizations that rely on their storage subsystem to take snapshots. This may seem like sufficient protection, but storage subsystems do fail. Or what if some sort of catastrophe makes the snapshots unreadable? OS images and snapshots can’t be recovered if they no longer exist in a readable form.

Again, testing is critical. You’ll never know your backups are good unless you try to restore them. And when was the last time you audited your backups? Changes happen. Are you sure the backups you set up are still running properly? Even restoring individual files periodically can help you confirm that the backups can be read and the data still exists.

Backups shouldn’t be limited to enterprise systems, either. VIO servers and the HMC should also be backed up and maintained. Make sure boot media and any other necessary tools are readily available should you need to rebuild machines after a disaster.

Risking It All

As I noted at the beginning, with some enterprises, the choice is to do nothing. Backups may not occur at all or they’re rarely tested. Legacy systems may be left to run without being maintained in any way.

Again, risks and costs are being weighed, but in these cases, the risk may be misunderstood, or seen as negligible, while any cost is viewed as onerous. I won’t offer a lengthy defense of IT spending because if you’re reading this, it’s highly likely that you fully understand the need to protect data and the systems that store it. Plus, that’s probably in your job description.

Whatever choices are made, whatever is invested and whatever risk is allowed, it’s critical that your backup and recovery process is thoroughly documented and that everyone in the organization understands the ramifications of these decisions. If you have concerns about, say, recovering your LPARs, make them known immediately, before an event occurs.

Certainly, additional backup solutions and options are available—I didn’t discuss virtual tape, for example—but hopefully some of these points will help spark an honest assessment of your current situation.