This post is Part 1 of a 5 part series about Disaster Recovery in the Cloud and how Stratum is leading the way with our partners to enable Cloud Computing in our customers’ environments. This post discusses the key concepts of Disaster Recovery and dispels some common myths.
Disaster Recovery. We all know these dreaded words. Even the mention of the words makes an IT Operations Team shiver. Whether on an operations team that builds the runbooks for execution or an architecture team tasked with ensuring applications meet the fabled “Five Nines” of availability, Disaster Recovery scares us all. The very idea of building out an entire disaster recovery infrastructure leaves us in a cold sweat. Add in management of warm-standby servers, data synchronization schedules, and semi-annual (or even the dreaded quarterly!) readiness tests, and any operations team rapidly loses hope.
But Disasters don’t need to strike fear into the hearts of IT organizations. Disaster Recovery is actually a good thing. It’s the one thing that that IT Organizations today can do that is cost effective and improves business survivability.
Stratum’s Headquarters is in the Houston, TX metropolitan area, and one thing we’ve learned is how to deal with “named storms”, a common disaster scenario along the Gulf and East Coasts of the US. Every named storm is a disaster recovery exercise for Stratum. We begin procedures long before a hurricane or tropical storm ever makes landfall in the area. We have daily updates for our staff when a storm forms as something that should be monitored. As the risk escalates, so to do our communication plans and readiness exercises.
It’s planning that matters. It’s a consistent playbook that is clear and concise in its actions. If you’ve lived through a disaster, you know when plans depend on someone affected by the disaster, those recovery activities are never successful. Simple things like that may mark the difference between a disaster recovery exercise and simply disaster.
Disaster Recovery vs. Business Continuity
First things first: Disaster Recovery is NOT Business Continuity. The two are often used interchangeably, and often lead to challenges when operating within a disaster scenario. In organizations where there is no clear Business Continuity Plan (BCP), there is often a lack of clear objectives for the recovery activities; there’s no defined success criteria for a successful execution of the plan. Similarly, the activities are taking place (either automatically or with some manual intervention) without knowing what happens next. What criteria determines a recovery are often as difficult to understand as the plans the operators are asked to execute.
To alleviate all this confusion, we provide some definitions at Stratum. Business Continuity is the science of ensuring your business can survive in the event of a catastrophic event. This event doesn’t have to be an environmental one, although that is often the case. The social upheaval of the Arab Spring, the blackout in the Northeastern United States in 2003, and the widespread outage of the Internet in November 2016 are just some examples of when disaster may strike with no warning. There needs to be a continuity plan in each of these cases to protect against local, regional, and potentially global events that disrupt your organization’s image and livelihood.
Disaster Recovery is an activity in the Business Continuity Plan. There are so many additional activities that take place as a part of a Business Continuity Plan that don’t relate to IT that some Universities are now offering degrees in it. The focus of this set of posts is going to be primarily on Disaster Recovery.
The Value to Cost Ratio
Before anything else takes place, it’s important to know some things about the system(s) you want to protect and plan for recovery. At Stratum we use business characteristics of the system (or the application it supports) to help determine the how important it is. Stratum uses two specific characteristics: Value of Application/System, and Cost of Application/System.
The Value of the application is a difficult thing to calculate for some organizations. If you’re calculating an e-commerce application that processes orders Monday through Friday from 8 am to 6p, then it’s pretty easy to calculate (example: take the revenue generated each month divided by the number of hours running). There are scenarios where it’s not quite as tangible, but that’s not the focus of this series. The critical point is that you should be able to determine a value to the business for each application or service.
The Cost of the system is simple to calculate. Whether that’s an on premise physical machine or a cloud-based database platform, the cost is measured by a cost/month to run that infrastructure.
We use a simple ratio to calculate the importance of the system.
With this simple ratio, we can determine what the impact is on the organization in the event of an outage. We use this number to determine The Recovery Time Objective and the Recovery Point Objective.
The Basics: RPO and RTO
If you have never heard of these two acronyms, please read this next section VERY closely. They will become the two most important acronyms in your daily vocabulary.
The RPO of a system is its Recovery Point Objective. In simple terms, an RTO defines the tolerance of lost data. If each and every transaction must be recoverable, for example in a financial services organization, then the RPO is effectively 0 and requires data duplication and system redundancy. If systems are designed to support weekly or monthly rhythms (such as a payroll system), the RPO may be able to lose some data and still be able to function. Stratum leverages the RPO to help us determine the level of data redundancy required for the system.
RTO is the abbreviation for the Recovery Time Objective. The simple explanation for RTO is the number of minutes, hours, or days that are acceptable for the system to be unavailable. The RTO describes how tolerant your organization is to an outage, using time as the measure. Most organizations ONLY define the RTO as a part of the Service Level Agreement (SLA) for the application. This results in organizations that build for High Availability (HA) without consideration of the important question: what happens if the whole site goes offline? Instead, we leverage RTO to describe the amount of staging and readiness required for recovery operations.
Now that we’ve determined we have an application that has a Value Coefficient greater than 1.0 and have assigned it an RTO and RPO, we now need to determine the Recovery Pattern and the Recovery Practice. In Part #2 of this series, we’re going to look at the Patterns and Practices for Disaster Recovery and how Cloud Providers make building DR architectures simpler.