By Siew Sim
Last year’s hurricane season started a national discussion about how prepared the nation is to cope with a major disaster. Business in the Gulf Coast was devastated and in the boardrooms across America senior executives have tasked IT professionals with creating and implementing solutions that will ensure mission critical data is continuously protected. IT professionals across all industries have come to realize that even with thorough planning the ability to restore data and bring systems back online quickly with zero loss of data can be an overwhelming task.
The complexity and cost of solving data protection and recovery issues today is rooted in the fact that it takes multiple tools to deliver a solution that still doesn’t meet the new requirements of today’s data center. This leaves IT professionals spending countless hours trying to integrate disparate tools and manually recovering data in an attempt to build a real-time infrastructure to support their enterprise. Because there are a variety of protection and recovery tools to choose from, it is crucial to arrive at core metrics to enable IT management to choose the best recovery management solution for their environment.
Recovery management is defined as the act, manner, or practice of managing a return to normal conditions. In the IT industry the definition is more specific - it describes how organizations return systems, applications, and data back to “normal” conditions. When unexpected failures occur, the goal is to bring IT systems back to its most recent consistent state and to restore business operations within minutes, to reduce downtime, and prevent significant financial loss.
The Evaluation Metrics
In order to evaluate a recovery management solution, one must have properly defined metrics. Data recovery service level agreements (SLAs) are traditionally measured by recovery time objectives (RTO) and recovery point objectives (RPO). RTO defines the time required to recover a set unit of missing data, and RPO defines the potential data loss – the time gap between the most recent application consistent recovery point and the physical failure point. RTO and RPO are good objectives for setting SLAs with regard to data recovery, but they are not sufficient for measuring a recovery management solution. For example, a snapshot tool may recover a server’s data in minutes; however, a snapshot tool does not have the ability to recover a granular object. When one needs to locate a lost object from snapshots, the process is manual and the RTO could be many hours. In this case, RTO has nothing to do with the tool per se, inasmuch as it is entirely dependent on the manual process. While a data replication tool is capable of delivering zero or near zero RPO when a server fails, it is not capable of recovering business data if the data is corrupted, and the corrupted data is replicated.
As a result of examples like these, IT requires more comprehensive metrics to properly evaluate a recovery management solution. There are ten core metrics that fall into three categories – Recovery Time Characteristics, Recovered Data Character-istics, and Recovery Scalability Characteristics. The following chart explores these metrics in detail.
Definition
Recovery Time Characteristics
Recovery Time Objective (RTO). RTO defines how fast the solution is capable of recovering the data and application it is designed to protect. The RTO of most recovery solutions depends on whether or not a data verification process is needed during the recovery, and the size of the data set to be recovered. A solution that provides instant recovery regardless of data set size greatly reduces or eliminates business down time.
Recovery Time Granularity (RTG). RTG determines the time spacing for selecting a recovery point; this is an important parameter for recovering from logical failures. Unlike RPO, which determines the last recovery point prior to a physical failure, RTG defines recovery point selection options prior to the most recent recovery point.
Recovered Data Characteristics
Recovery Point Objectives (RPO). RPO defines the minimum time gap between the last failure and the point-in-time where data can be recovered. The smaller the gap, the less data is lost.
Recovery Object Granularity (ROG). ROG measures the level of objects that a solution is capable of recovering. For instance, object granularity may be a storage volume, a file system, a database table, a transaction, a mailbox, an email message, etc.
Recovery Event Granularity (REG). REG measures the capability of a recovery management solution to track events and to recover a failed application or missing data to a specific event.
Recovery Consistency Characteristics (RCC). RCC defines the usability of recovered data by the associated application. RCC of a recovery management solution depends not only on how data is captured and stored, but also on the data type being protected.
Recovery Scalability Characteristics
Recovery Location Scope (RLS). RLS defines where the protected data must be stored when recovery takes place. Most data protection solutions are designed such that the protected data is stored locally. Robust recovery management solutions can protect and recover data over LAN and WAN.
Recovery Service Scalability (RSS). RSS is measured by service (number of applications or data sets the solution is capable of protecting) and capacity (the maximum size of the data it can store).
Recovery Service Resiliency (RSR). RSR defines how well a recovery management solution tolerates failures. This includes system and data failures as well as data security authorization. For instance, if a system component fails, can the solution continue such that an application would be continuously protected? And can it also self-recover from any internal failures?
Recovery Management Cost (RMC). RMC defines the cost efficiency of a recovery management solution. Data services such as backup, snapshots, replication, policy management, and others are traditionally separate tools with very different architectures. For better RMC, find a consolidated recovery management platform which simplifies IT administration by reducing the amount of tools necessary to manage data. For further efficiency, utilize a solution which reduces the storage and network resources necessary to protect and recover data.
Recovery Management Scorecard
As we discussed earlier, there are a myriad of protection and recovery tools to choose from so it made sense to come up with the core metrics necessary to enable IT management to evaluate which solutions would best fit their environment. Now that we have an understanding of the “Top Ten” metrics necessary to evaluate a recovery management solution, let’s apply these metrics to solutions that exist in the market today. Practical application of the metrics enables not only a solidified understanding of the metrics, but also a better comprehension of available solutions and how they compare.
Summary
In most industries today, the service level agreements for data protection and recovery have moved to a point where there is no time for backup windows, no tolerance for data loss, and very little margin for recovery downtime. Add to that the increased business demands for disaster recovery of mission and business critical data, along with new compliance requirements and you can quickly determine that the legacy tools of data protection and recovery are ill-equipped to handle today’s requirements. The ten metrics of recovery management above enable IT management to apply thoughtful consideration to their own internal business requirements against the products they are evaluating.
Siew Sim is the chief technology officer of Asempra Technologies in Sunnyvale, CA
www.asempra.com