Recently, I was involved with the implementation of a replication solution at an electric utility company. This implementation was the center piece of a disaster recovery (DR) implementation for one of their power plants. Normally, when a replication solution is implemented, data recoverability at the remote site is mandatory. After all, what good is the replication solution if it cannot be used to recover data at the remote location?
What I found strange in this particular situation was the lack of any remote recoverability apparatus — i.e., servers, test plans, etc. When I inquired as to why that was the case, the storage architect replied, “It’s just another project that we want to complete by the end of the year — we just want to mark it off as completed.” I was shocked, to say the least. Which brings me to the topic of this article: Do IT departments treat disaster recovery as just another project? Or is DR a way of life for most departments? If it is the latter, do such departments ensure that this way of life is dependable?
As we look back on what the last decade has taught us and launch a new decade that builds upon such learning, it is obvious that the world has become a very uncertain place to live in. You never know what will happen and what circumstances will lead to disruption of normal life. Businesses are no different. What this means is that the bar for disaster recovery has been raised. It cannot be simply treated as just another project that is implemented and forgotten. While the levels of DR for each IT department may vary, the constants are the need for a sound disaster recovery strategy, as well as the ability to achieve remote recovery in the case of failure of the primary site. These should become a way of life for all IT departments.
Disaster recovery means different things to different companies. For some, the line between operational recovery and true disaster recovery is a thin one. For others, business continuity is all about disaster recovery. In defining the true meaning of DR, there is one thing for certain: It is not all about technology. In fact, a well-oiled DR strategy should be a perfect blend of technology, people, processes and protocol.
You need technology because it forms the foundation for getting your most important data from point A to point B. You need people — not just any people, but the right folks who know their tasks well. Processes ensure consistency, and protocol ensures that the appropriate person or team makes critical DR decisions in accordance with the overall business objectives.
DR and Technology
The relation of DR and technology is relatively obvious. Technology provides businesses, regardless of the industry, ubiquitous access to data. Technology also enables businesses to mobilize and protect their data and make it highly available. The proper choice of technology can therefore allow this data to be quickly accessible from an alternate — namely “remote” or “secondary” — location should the primary location become unavailable.
How quickly you want it accessible (i.e., the recovery time objective or RTO) and how much data loss is sustainable (i.e., the recovery point objective or RPO) will decide the amount of investment that needs to be made. Key investment areas:
- Physical infrastructure at the remote location
- Recovery platforms
- Data replication technologies
- Network connectivity and last but not the least
- Automation technologies
The physical infrastructure generally starts with the data center that houses the necessary devices to sustain the business during peak hours. Devices themselves include servers, storage, networking equipment and all the other “stuff” the business may require to function “normally.”
The recovery platforms are server builds complete with all the applications installed and configured. These days the use of server virtualization has made this task relatively easier. Data replication technologies are technologies like storage array-to-array, array-to-appliance, server-to-server or application-based replication.
Many companies choose to deploy a one-size-fits-all solution. Some choose to deploy an a-la-carte solution that incorporates the best-of-breed technologies. Each solution has its pros and cons, but the important thing is to implement a solution that works, is reliable and is cost-effective.
Network connectivity between the production and remote site is necessary so that the data gets replicated in a timely manner and to reduce time lag. If the network pipe falls short, the data change rate (a concept that is used to measure how much data will be replicated) will not allow the remote location to catch up to the production site, leaving the RPO unachievable. Conversely, a network pipe that is bigger than the change rate will imply that you are spending more money on the network than is needed.
Lastly, automation technologies are necessary to make the job of switching production “traffic” to the remote location seamless and quick. Since time is of the essence during a production outage, such technologies are really a life saver if deployed correctly.
DR and People
Let’s face it, we are not yet at a point where technology has become so intelligent that it can manage itself. No DR strategy can be successful without a proper portfolio of skilled resources. Notice, I mention “portfolio.” The reason for this is that a DR exercise, whether it is in actuality or whether it is a test, is a team effort.
In fact, in larger organizations, this often requires several teams all working in concert under the guidance of a conductor, often a person or team that oversees DR activities for that company. It is the task of this body to identify the skills that are necessary to bring about a successful switch from production to DR.
The “deployment” team for DR should not read like the company’s phone directory but rather contain the right blend of technical individuals, program managers and project managers — all of whom will have a specific set of tasks given to them.
I highly support a cross-functional team model. A traditional organizational chart need not be maintained for a DR exercise or deployment. Instead, cross-functional teams ensure that the best folks on the job come together and collaborate to ensure the success of a DR exercise. This team should be in a perpetual standby mode — ready to swing into action when protocol demands it.
DR and Processes
In my opinion, processes are the glue that bond technology and people together. No discussion of DR is complete without detailed process plans that govern actions undertaken by the technology and the people driving the strategy. So important are processes and protocols in a DR site that no matter how good the technology or skilled the people, a sloppy or improper deployment can render the entire DR solution a disaster — no pun intended.
So, what are the key goals when creating processes? The single most important goal for DR processes is to ensure that business goals are met. Further, it is important to ensure you meet the RPOs and RTOs dictated by the business.
However, the purpose for processes does not stop there. Processes ensure consistency in the outcome of any DR exercise. They create structure for tasks to be performed in a given manner and in a particular sequence, and they ensure that all individuals and teams perform their tasks in a precise manner.
DR and Protocol
Protocol in DR is like the red nuclear button. It is to ensure that management (with proper authority, of course) calls the shots, announces when DR has to be triggered, and when the business needs require a switch from the primary site to the DR site and back again. The person or team responsible is generally in charge of communications both upstream and downstream and ensures that there are no ambiguities.
Protocol is also important in triaging issues and resolving conflicts, meeting unexpected challenges, and addressing any problems that are outside the established domains of other people, technologies and processes.
DR is not and should not be treated as just another project. Instead, every company should seek to build a robust DR strategy that incorporates the appropriate mix of technology, people, processes and protocol into the way it does business.
A good DR strategy will ensure that the organization has in place the mechanisms that successfully allows it to be in business at all times, performing as a well-oiled machine that switches gears seamlessly when needed.
Ashish Nadkarni is a principal consultant at GlassHouse Technologies.
An excellent and well reasoned article. Critical processes need to be made resilient and it’s time we moved on from accepting that data recovery is all that is required to protect business interests. Disaster Resilient design can avoid end-user downtime. Disaster Recovery design accepts that IT failures will bring down business processes.