Without a Clear Recovery Plan, Backup and Replication Is Not Enough
The mainstream adoption of the public cloud has made it possible for organizations of almost any size to acquire true disaster recovery (DR) capabilities for critical workloads. There are currently any number of solutions available for failing workloads over to the cloud. Even so, the failover process can be surprisingly complex, and a failover is unlikely to succeed unless the organization has taken the time to develop a comprehensive recovery plan.
On the surface, DR seems quite simple and straightforward. Virtual machines (VMs) are replicated to the cloud on an ongoing basis. If and when disaster strikes, the cloud-based replica VMs take over for the VMs that have failed. As is the case with so many other things in IT, however, the devil is in the details. Failing over to a cloud-based VM replica is rarely as simple as just powering up the replica VM.
One issue that must be considered is the web of dependencies that exist between workloads. It’s becoming increasingly rare for workloads to be confined to a single VM. More often, workloads are multi-tier and span numerous VMs. In addition, a workload may also have infrastructure-level dependencies. Consider, for example, the dependencies that may exist for a simple web application.
Such an application would likely have a dependency on a back-end database, residing on a separate VM. That database might in turn have a dependency on Active Directory. Active Directory, in turn, has a dependency on DNS. As such, simply failing over the web application server wouldn’t result in the web application running in the cloud: you would also have to fail over all of the dependency components.
Another often overlooked consideration is the impact that the failover process can have on IP addresses. If you fail a VM over to the cloud, the replica VM is running in a different subnet from the one in which the VM was originally running. If the VM has been assigned a static IP address, the IP address will not be valid within the new subnet. If the VM is configured to use a dynamically assigned IP address, it will receive an address that’s appropriate for the subnet (assuming that a cloud-based DHCP server is in place). In either case, the cloud-based VM may be cut off from the rest of your network.
There are any number of ways to fix this problem. One possible solution is to use orchestration to automatically update IP addresses, DNS records, routing tables, and so on. Another possible solution is to use network virtualization in a way that allows the cloud-based VMs to retain their original IP address configuration. In any case, you’ll need to have a plan in place for dealing with IP addresses during a failover to the cloud.
The point is that even if an organization dutifully backs up its IT workloads, and replicates everything offsite, the DR process can fail as a result of the organization not taking into account the low-level requirements for running workloads in the cloud.
The single most important thing that an organization can do to avoid the problems that so often occur in DR situations is to develop a comprehensive DR plan. This plan should rigidly define workload requirements, as well as the process for ensuring that those requirements are met.
Of course, this plan will need to be validated to ensure that it works. The only way to reliably validate an organization’s plan is through DR testing. Keep in mind that even if the initial DR tests are successful, however, DR testing should be treated as an ongoing process. There are two important reasons for this.
First, proactive testing can help the organization spot problems before a recovery operation actually becomes necessary. Changes made to an organization’s IT resources can cause DR operations to fail. Regularly scheduled DR testing is the best defense against such unanticipated failures.
A second reason proactive testing is so important is that it helps the IT staff become more familiar with the recovery process. Remember, the amount of time that it takes to perform a failover could mean the difference between still being in business in a year, or not. Testing can confirm that both the DR infrastructure and the IT staff can perform at a level that will allow a recovery operation to complete within the required time frame.
In any case, replicating IT resources offsite is really only the first step in the disaster recovery process. It’s just as important to create and regularly test a recovery plan, and to keep that plan up to date.