RTO and RPO
It can be a struggle to determine and quantify an acceptable level of risk for your organization, and disaster recovery is no exception. RTO, which stands for Recovery Time Objective, and RPO, which stands for Recovery Point Objective, help reduce the ambiguity at an operational level. So you have a pragmatic framework you can use to plan and execute.
However, it’s common to focus on the “hows” of recovery: the technologies you’ll use, their features, and the extent to which they can recover entire systems in one click. As a result, you spend less time discussing the organizational ramifications—how much data to retain and how quickly it can be restored. By focusing on RTO and RPO, you can ensure that these important aspects don’t go overlooked.
In the below guide, we’ll break down RTO and RPO, including:
- What is the meaning of RTO and RPO?
- What is the difference between them?
- How do RTO and RPO fit into your disaster recovery plan?
- How do you calculate your RTO and RPO?
- What is the impact of RTO and RPO?
- How can you improve your RTO and RPO?
How Do You Define Recovery Time and Recovery Point Objective?
RTO and RPO can help you define your business requirements, which will differ between applications, and measure how well your data protection solutions can satisfy them. In addition to RTO and RPO, there are some other helpful terms to know. We’ll break each one down below.
The meaning of RPO focuses on how much data your business can afford to lose, measured in time (e.g., one hour worth of data). If a production system is impaired by data loss or data corruption, you can recover by reverting to a backup. So RPO defines how far back you are willing to go, accepting loss of all the data beyond the latest recovery point. As you can see, it’s an important part of your disaster recovery plan.
This, in turn, determines the granularity, or frequency, of your point-in-time copies. If your tolerance for data loss is low, you will need backup more frequently and often dedicate a larger amount of storage to house these backups.
RPO comes directly from business requirements. As different applications within your organization carry different business value, RPO is fundamentally an application-specific attribute.
In determining RPO, you should consider the risk of faulty backups. One faulty point in time doubles the achievable recovery point between the two adjacent points in time. By regularly testing your backups, you can ensure that they are recoverable when needed. Even misconfigurations and lapsed licensing can wreak havoc on efforts to return to full production.
So what is the meaning of RTO? RTO is about how much time your organization can afford to lose after a disaster strikes until you’re back in business. Generally, this relates to the entire time it takes to operationalize. Depending on your disaster and protection scenario, there are multiple factors to consider, many of which are often overlooked.
- Disaster declaration: Who is authorized to declare a disaster and commence recovery? What are the measures they must take before the red button is pushed?
- System setup: If production site is damaged, how much time does it take to set up an operational system at a secondary site?
- Recovery execution: How long will it take to get the right people to execute recovery?
- Backup access: How much time would it take to gain access to the backup data? Is it online or does it require physical travel? If it is stored on a remote site, how do you connect if your primary site is down?
- Transfer: How long does it take to transfer the data? If data is stored in the same site, transferring a 100GB dataset over a modern 10GbE network takes about 1.5 minutes, and nearly 15 minutes over a 1GbE network.
- System restart: Take into account the time it takes to restart servers, launch applications, and load the data into production.
It’s important for you to analyze how your recovery process is impacted by various activities so you can establish a realistic RTO. If you don’t plan these processes correctly, you’ll end up organizing and defining an action plan in real-time. And that means your actual recovery time won’t meet the designated objective.
What is the Difference Between RPO and RTO?
Before we dive any deeper, it’s important to spell out the differences between these two terms. RPO focuses on data loss, while RTO focuses on application downtime and how long it takes to become fully operational after an outage. While they’re related and measured similarly, they focus on two important, but different, aspects of disaster recovery.
Now, let’s take a look at some other terms that come into play for both RPO and RTO.
Technical RTO (TRTO)
You can zoom in to correctly identify your technical RTO. This refers to the amount of time consumed within the boundaries of your data protection solution. Steps in this phase may include:
- Spinning up a new set of VMs hosting the application.
- Configuring the VMs correctly and establishing communication.
- Transferring the data from the backup medium to the production storage system.
- Launching the applications and loading the recovered data.
The benefits of relaxing TRTO must translate to cost savings, for example, by auto-tiering the backup data storage from SSD to spindles.
When referring to RTO, keep in mind the difference between overall recovery process and the technical recovery phase. Have a crisp definition of your TRTO and make it clear which RTO you are referring to.
Retention period is the duration your business requires data copies to be stored until they may (or must) be discarded. Like RPO and RTO, retention periods are application-specific.
In addition, the business value of data often decreases with time, becoming less valuable the older the data gets. Retention requirements may therefore be respectively reduced. The business requirements may be captured using time tiers as defined below.
Service-Level Agreement (SLA) Tiers
Given the changing requirements over time, Retention Periods, TRTO, RTO, and RPO are specified using time tiers. This may be referred to as SLA Tiers (SLA is, unfortunately, an overused term). For example, an organization may require the following tiers for their MySQL application:
|<24 hours||5 minutes||1 hour||For running 24 hours, your business application will be operational within minutes, utilizing hourly point-in-time backups.|
|1-30 days||1 hour||1 hour||For the rest of the month, you want to be able to complete technical recovery within an hour to the nearest hour.|
|30+ days||24 hours||24 hours||For anything older than a month, you will be able to restore a point-in-time from a particular day from your archiving system within 24 hours.|
Consider specific use cases to help you define these requirements. The first tier in the above example addresses a common data damage scenario such as when a VM is accidentally deleted, a file or folder has been overwritten, or a database has been corrupted. Different from a disaster scenario, all systems are operational.
Your business can sustain some data loss but requires minimal downtime. For this scenario to be practical, there must be no operational overhead. This means that end users (tenants in a cloud environment) must be able to execute the recovery on their own, without administrative assistance.
Less demanding SLAs allow for cost reduction through utilization of slower, lower cost storage mediums or facilities. Your organization must become comfortable with the trade-offs of losing more data or having longer down-time after a disaster strikes.
How Do You Fit RPO and RTO into Your Disaster Recovery Plan?
Disaster recovery is an increasingly important component of modern business operations. When an event strikes that causes a business to lose connectivity, every minute of downtime can represent thousands in lost revenue and slowly erode customer confidence in your business.
Years ago, disaster recovery plans centered around on-premises backups. Today, the increased use of the cloud means that downtime can grind most business activity to a halt. A good disaster recovery plan not only dictates the steps an organization should take in the event of a disaster, but the process of developing the plan also informs your vulnerabilities and needs.
Driving Components of Disaster Recovery Planning
A recovery point objective (RPO) and recovery time objective (RTO) are crucial elements in any disaster recovery plan. RPO and RTO not only help in identifying risks and needs but also dictate the type of backup infrastructure that your enterprise will put together.
RPO seeks to answer a relatively simple question: “how much data can I afford to lose?” It’s a measure of tolerance that helps an enterprise know the impact of downtime. It sets a line in the sand as far as the amount of time that a network can be down without risking data loss that goes beyond the threshold set in a plan. In other words, how far back must an organization recover after a disruption to resume business as usual?
RTO is a target time that an organization sets for recovery after an incident. This component is a calculation of how fast a business has to mobilize to achieve an acceptably small gap in business continuity.
For example, if an organization determines that they need to rebound from an outage within three hours, then it has a three-hour RTO. If your volume of critical data is low, then your organization might be able to withstand longer delays. Some organizations have very little tolerance for downtime and need a solution that provides as close to 100% uptime as possible.
How Do RPO and RTO Aid in Data Recovery?
RPO and RTO are important components of your disaster recovery plan because they provide your organization with benchmarks around acceptable amounts of data loss and downtime.
In a perfect world, the second your application goes down, you’d be able to immediately recover all of your data and be operational again. However, that’s not reality. So you need to decide what those acceptable levels are without impacting your business objectives. They will differ based on your applications, their importance to your business, and also on your own company resources.
That’s what RPO and RTO can help you do. RPO will determine the amount of acceptable data loss, while RTO determines the amount of acceptable downtime. Both goals, measured in terms of time, can help your organization determine backup frequency, type, and tooling.
How Do You Calculate Your RTO and RPO?
It’s easy to see that both the RPO and RTO cannot be chosen lightly. The numbers have to reflect how your organization functions, from all aspects. However, some general guidelines can help a business determine these timeframes.
Factors to consider in determining RPO include the frequency of crucial data changes in your business, your backup schedule, and resources for the backup. An organization can also consider the projected costs of downtime. Costs can be complicated, as unscheduled interruption in a business ripples throughout the organization, affecting everything from sales figures to salaries.
These factors also play a role in determining the appropriate RTO for your business. How much can your business absorb before it has a noticeable impact?
Integrating RPO and RTO into your business impact analysis
So, what does all of this mean for most businesses? The concepts of RPO and RTO combine to help determine business impact analysis. This analysis is a methodology used to identify and measure the impact of a loss of services such as network connectivity. The business impact analysis is a time-intensive process that incorporates all activities within the organization, critical or not.
Determining this impact requires not only in-depth knowledge of a business’s operations, but experience in the differing capabilities of backup and disaster recovery systems, based on expense and feature. One way to achieve minimal downtime with maximum efficiency and protection is implementing a cloud-native recovery solution that restores to a point-in-time. By retaining valuable metadata and restoring workloads to their last-best-known state, your organization can significantly reduce RTO in the inevitable event of an outage.
What is the Impact of RTO and RPO?
One of the (many) reasons Trilio supports tenant-driven recovery workflows is because it allows organizations to trim their RTO while still mandating SLA-driven RPOs at a management level. This balance allows administrators to define a protection schedule and policy for each workload, but gives your tenants control to manage and restore point-in-time backups without requiring intervention.
Defining RTO and RPO helps to strike a balance between disaster preparation and cost efficiency, while promising critical data availability that’s needed to run your business. Data loss may occur even when the infrastructure is uninterrupted, and preparedness is yet another tool at our disposal to limit and mitigate the potential negative impacts of unexpected data loss.
How Do You Improve Your Recovery Time Objectives?
While your organization’s RPO is largely dependent on your backup frequency and completeness, there are many varied factors that impact the time it takes to get your environment back to a working state. Let’s take a look at actionable tips for lowering recovery time objectives (RTOs).
Offsite Disaster Recovery
As mentioned, organizations can immediately shrink their RPOs by increasing backup frequency. More backups mean more snapshots of an organization’s critical data, which in turn lowers RPO. Frequent backups enable access to more recent snapshots of your critical data, thus reducing the time needed to execute a recovery.
[cta] Redefine your cloud data protection with automated, scalable recovery. Watch the webinar to learn more [/cta]
However, your backup strategy can also improve your RTOs. Keep an off-site secondary copy of live data sets that you can instantly switch to in the event of a disaster. By storing the copy in an off-site server, your RTO is reduced to the time it will take to failover from this server. Since replication frequency determines your RPO, you can lower RPOs on this off-site server by replicating more often.
Using “Changed Block Recovery” Solutions
You can lower your RTOs by using incremental forever backup solutions with “changed block recovery.” Incremental backups are much smaller than full backups since they only backup the blocks of data that have changed since the last backup or the blocks of data needed to restore a workload to a given point in time. In the event of a disaster, these changed blocks can be automatically reassembled to form a synthetic full backup image.
Solutions like Trilio also capture valuable metadata — including operating system, applications, configurations, and more — saving you the time and headache of piecing together workload snapshots from different sources. This reduces your total backup time and, by extension, your RTOs.
When using such solutions for a physical or virtual backup, further modifications to data blocks are continuously monitored via changed block tracking.
Location is Key
You can also lower recovery times by ensuring that your recovery media is in the same location or platform as your recovery/ failover servers. Although cloud backups come with a lot of benefits, having the only copy of backup data on the cloud and trying to execute a recovery on your on-premise servers will result in several challenges.
Downloading all your data may take days or even weeks. You should either keep a local copy of backup data on-premise (preferably offline) or recover your applications and workloads in the cloud.
To achieve zero or near-zero RTO, organizations should leverage synchronous mirroring. This approach works by synchronously writing I/O from primary storage media to another mirrored system. The first system waits for acknowledgment from the second system before writing the next I/O set. The secondary copy is stored in an active state that enables immediate recovery — in essence, high-availability in a dual-node clustered server.
Finding the Right Storage Backup
Your RPO and RTO metrics should influence your choice of backup infrastructure and data redundancy strategies. The lower your RPO and RTO, the more complex and expensive your backup infrastructure will be.
[cta] Clunky NFS gateways slowing you down? Learn how TrilioVault stacks up against this, and other legacy approaches. [/cta]
If your RTO is zero, it means that your business cannot afford any downtime. Leveraging a redundant IT infrastructure with off-site storage of replicated data or a high availability cluster for seamless failover might be your only choice.
One of the best ways to improve RTO is by leveraging backup solutions (like Trilio) that can execute “recovery in place.” Also known as “boot from backup” or “instant recovery,” backup solutions with such functionality allow you to run data stores and servers directly from the backup.
That’s preferable to waiting for the failover to complete before you can access your systems and data. Although this is a relatively new functionality, it is the fastest way to achieve business continuity in the event of a disaster.
Rewards Exceed the Effort
Reducing recovery time objectives can have tangible benefits for your organization. Although extra measures must be taken in order to achieve this, the rewards can far exceed the effort. Leveraging a robust data protection strategy and best-in-class data backup solution to protect internal and external cloud environments is a must for any organization dependent on availability and resiliency.
To learn more about incremental backups and improving your RTO and RPO with in-place recovery, learn more about TrilioVault for Kubernetes or chat with one of our K8s experts today.