Business continuity is essential to any modern enterprise. In a world where organizations constantly embrace the microservices paradigm and applications are increasingly deployed on containerized platforms, traditional disaster recovery (DR) solutions often fall short.
As a leading enterprise Kubernetes platform, OpenShift requires a specialized approach to DR that accounts for its unique architecture and the complexities of container orchestration. An effective OpenShift DR strategy goes beyond simple backups, focusing on the swift and reliable recovery of applications, persistent data, and the entire cluster state.
This article explores the key components and different approaches that can be used to implement a comprehensive DR solution for OpenShift.
Summary of key OpenShift disaster recovery strategies
The following table summarizes the key OpenShift DR strategies explained in this article.
| Strategy | Description |
|---|---|
| Backup and restore | The most basic active/passive approach, relying on point-in-time data backups. It provides the lowest cost but often limits the RTO/RPO goals achievable due to recovery time. |
| Volume replication | Replicates persistent volumes between sites at the storage layer (synchronous or asynchronous), which drives the reduction of RTO/RPO but requires specialized storage. |
| Application-level replication | The application or middleware handles data synchronization between instances, enabling stringent RPO/RTO goals while ensuring data consistency. |
| Distributed stateful workload | An active/active strategy running across multiple sites, a design required for zero RPO and near-zero RTO but sensitive to network latency. |
Automated Application-Centric Red Hat OpenShift Data Protection & Intelligent Recovery
Understanding OpenShift disaster recovery
The risks to OpenShift, as with any other enterprise technology environment, include a range of potential disruptions such as hardware failure, software problems, power or network outages, physical disasters, and, of course, cyber attacks like ransomware. An enterprise using traditional methods could address this by replicating an entire virtual machine or creating an immutable snapshot and being reasonably confident that they have captured everything needed to run an application. With OpenShift, there are more dependencies, and the flexibility and distributed nature of containers can make applications vulnerable to single points of failure (where a distributed architecture can magnify the impact of hardware outages).
The following figure highlights the major differences between traditional DR and OpenShift DR.

Comparing OpenShift DR with traditional DR
Conventional disaster recovery typically involves backing up and restoring entire virtual machines, their physical counterparts, and their file systems. In contrast, disaster recovery for OpenShift takes a different approach, centering on the application itself and its unique platform requirements. The main challenge here goes beyond simply ensuring data integrity; it’s about safeguarding the entire complex state of a containerized application, including all its configurations and manifest files. This requires a much more delicate and specialized approach, such as using API-driven solutions to capture and restore the complete application stack, from persistent volumes to custom resource definitions, instead of relying on a single, monolithic infrastructure image.
Designing disaster recovery architectures for OpenShift
Before we discuss the different disaster recovery approaches, it’s essential to understand two key performance targets that determine the effectiveness of a recovery strategy:
- Recovery time objective: The RTO defines the maximum acceptable downtime, which is the target time set to have the applications and their underlying systems fully back online and operational following a disaster.
- Recovery point objective: The RPO specifies the maximum amount of data loss that is tolerable, representing the time interval between the last successful data synchronization or backup and the moment the failure occurred.

RTO and RPO
For the upcoming discussion on different disaster recovery architectures and design patterns, we will operate under the fundamental assumption that the workload being protected is stateful. This means the presented approach must not only handle application configurations but also securely and efficiently replicate persistent data to ensure both transactional integrity and a successful application restart on the secondary cluster.
To understand the options, let’s examine a common scenario: a basic application with a front-end and its underlying database. This example will serve as the foundation to explore four varying strategies for OpenShift DR implementation, ranging from simple recovery to highly resilient patterns:
- Backup and restore
- Volume replication
- Application-level replication
- Distributed stateful workload
Backup and restore
This strategy is the most traditional and simplest disaster recovery approach. It is an active/passive model, where the production environment is fully active and the secondary recovery site is a minimally provisioned, “cold” standby. This standby state is the primary reason the method is inexpensive, as it significantly reduces licensing and resource costs. However, because the secondary site is not running, recovery involves a full restore process, resulting in longer RTOs.
This approach offers the poorest overall performance in terms of both RPO and RTO due to its reliance on manual restoration and discrete backup intervals. This makes it unsuitable for applications with stringent business continuity requirements. Furthermore, managing backup solutions becomes increasingly complex and challenging to guarantee consistency as the scale of the application and cluster grows.

The backup and restore approach
Despite its limitations for achieving near-zero RTO/RPO, building a robust backup and restore capability is still essential for other critical data protection use cases outside of site-level disaster recovery, such as recovering from logical data errors or malicious data-level attacks. Therefore, while it is unlikely to be the primary DR strategy, it remains a fundamental capability that a platform should offer.
To support the backup and restore pattern, the underlying platform requires two key technical capabilities:
- Global traffic management: The ability to configure a global load balancer (GLB) that directs traffic to the primary site and can be manually or automatically reconfigured to switch traffic to the secondary site upon failure.
- Self-service storage backup: A robust storage infrastructure capable of supporting backup and restore operations. The platform should enable users to self-serve and configure their own backup schedules for persistent data.
The backup and restore functionalities are not inherently supported in OpenShift. However, many storage vendors have developed backup and restore solutions that operators can utilize, offering features such as snapshots via CSI drivers.
Automated Red Hat OpenShift Data Protection & Intelligent Recovery
Perform secure application-centric backups of containers, VMs, helm & operators
Use pre-staged snapshots to instantly test, transform, and restore during recovery
Scale with fully automated policy-driven backup-and-restore workflows
Volume replication
The volume replication approach is a fairly common one and is classified as an active/passive disaster recovery strategy. In this model, the primary cluster is active (running production apps), while the secondary cluster is passive. The strategy focuses on ensuring continuous data availability by having the underlying storage system continuously replicate the primary cluster’s persistent volumes to the secondary site.
The passive status remains because the applications on the secondary cluster are not running. Upon failure, administrators must manually activate the applications to consume the replicated data, thus completing the switch to the secondary site.
The critical distinction in this approach lies in the replication method, which directly impacts the recovery point objective (RPO):
- Synchronous replication: Data is written simultaneously to both sites, ensuring consistency across all replicas. While this technique is designed to guarantee a zero RPO (no data loss) upon recovery, it introduces inevitable application latency because the system must wait for remote acknowledgment from the secondary site before confirming the write operation as complete.
- Asynchronous replication: Data is written to the primary site immediately and then transmitted to the secondary site shortly afterward. This method ensures minimal application latency but inherently results in an RPO greater than zero, meaning a small amount of data loss is possible during a failure. The resulting RPO value is determined by factors like the volume change rate, the network latency between sites, and the size of the replication queue.

The volume replication approach
The RTO for volume replication is the time required to perform failover, which involves switching the global load balancer (GLB) traffic to the secondary cluster and restarting the application services. A key consideration here is that since the data is replicated at the storage layer (crash consistency), the database or stateful application on the secondary site may take longer to start up because it must perform a journal replay or consistency check before becoming fully operational.
This strategy demands specific technical capabilities, particularly within an OpenShift environment:
- Underlying storage capability: You need a storage system and/or a Container Storage Interface (CSI) driver capable of performing cross-cluster volume replication. This capability is highly sensitive to network latency, especially for synchronous replication.
- Persistent volume rebinding: Beyond simple replication, a key technical challenge is the portability of persistent volumes. The CSI driver must enable the OpenShift/Kubernetes control plane to “adopt” the replicated volume by ensuring that the new PVC on the secondary cluster correctly binds to the specific existing replicated PV rather than dynamically provisioning a fresh, empty one. This manual or automated “rebinding” is essential for data access.
- Global traffic management: As with backup and restore, a GLB is mandatory to provide the mechanism for switching application traffic from the failed primary cluster to the secondary cluster.
While it is technically possible to configure volume replication outside the OpenShift abstraction layer, this approach creates a design conflict. The static nature of externally managed replication often clashes with dynamic volume provisioning, necessitating careful design considerations.
Application-level replication
Application-level replication shifts the responsibility for data synchronization from the storage layer to the application or middleware itself. This approach typically uses an active/passive configuration where a primary instance runs on the active site and a secondary instance runs on the passive site.

Application-level replication
Because replication is application-driven, the system propagates transactions or write requests directly from the primary to the secondary instance. This provides a strong guarantee of application-level consistency, ensuring that the storage at the secondary site is always in a logically valid and consistent state.
The replication can be of the following two type:
- Synchronous: Transactions are committed only after acknowledgment from both the primary and secondary sites, which achieves a zero RPO.
- Asynchronous: Transactions are committed locally and then propagated to the secondary. While this offers low application latency, the RPO is greater than zero because data loss can occur during the transmission gap.
The RTO achievable is similar to that of volume replication, defined by the time necessary to switch the GLB and restart services on the secondary site. However, because the data is already application-consistent, the secondary instance can typically start up faster than with crash-consistent volume replication.
A significant advantage of this approach is that due to optimizations possible only at the application layer, it is generally less sensitive to inter-site network latency compared to synchronous volume replication. While its RPO and RTO performance is often similar to that of volume replication, the key difference lies in where the control resides, shifting the ownership of the data replication procedure more squarely to the application team that manages the stateful middleware.
Using application-level replication does not require any advanced capabilities from the underlying storage infrastructure. The technical requirements for application-level replication are listed below:
- Global traffic management: A global load balancer remains mandatory for redirecting traffic during a failover event.
- East-west communication: The most critical requirement is an east-west communication path that allows pods hosting stateful workloads in the primary failure domain to communicate directly with their counterparts in the secondary domain. In OpenShift, where pods run within a non-externally routable software-defined network (SDN), this can be challenging. It necessitates either a Container Network Interface (CNI) product with native cross-cluster federation capabilities or the use of network tunneling solutions to bridge the two cluster networks.
Distributed stateful workload
The distributed stateful workload approach represents the highest level of resilience and is the most advanced pattern for implementing disaster recovery. Unlike the active/passive models (like Application-level replication) where the secondary site remains idle and passive, this is an active/active approach.
In this active/active model, the middleware is entirely responsible for replicating transactions. Any workload instance in either site is capable of accepting write requests and autonomously coordinating across both sites to ensure that transactions are replicated to a sufficient number of peers before confirming a commit.

The distributed stateful workload
This architecture delivers superior recovery performance:
- RPO: Zero. Since transactions are validated across failure domains before commitment, there is virtually no risk of data loss in the event of a site failure.
- RTO: Minimal / near zero. No human intervention is required for failover. The system automatically detects the loss of instances (typically via a heartbeat timeout) and reorganizes its quorum membership, allowing the surviving instances to take over traffic seamlessly.
The main drawback is that this high resilience comes at the cost of reduced performance. The need to coordinate writes across domains makes the approach sensitive to latency. The latency for any single write transaction can be expected to be at least twice the round-trip network latency between the two failure domains. Over large geographic distances, this penalty can impose severe limitations, making this architecture unsuitable for a majority of common use cases.
Like application-level replication, the disaster recovery procedure is handled entirely by the stateful workload itself, meaning no human intervention is needed during the disaster event. This further emphasizes that ownership of the DR process resides primarily with the application development team, which manages the middleware.
The technical requirements are focused entirely on networking:
- Global traffic management: While a global load balancer is still required, its primary role shifts from facilitating a failover switch to simply distributing traffic across the healthy, active instances in both domains.
- Cross-cluster communication: A reliable east-west communication path between the pods across the two physically separated OpenShift clusters is critical. This complex requirement necessitates advanced networking solutions, such as a capable CNI product with cross-cluster federation or a robust network tunneling mechanism to ensure that the distributed workload instances can reliably form and maintain their consensus quorum.
Summary of OpenShift DR architectural strategies
The following table summarizes the DR approaches discussed in the last section:
| Feature | Backup and restore | Volume replication | Application-level replication | Distributed stateful workload |
|---|---|---|---|---|
| Architectural style | Active/passive (cold standby) | Active/passive (warm standby) | Active/passive (warm standby) | Active/active (stretched cluster) |
| Data consistency | Point-in-time (logical) | Crash-consistent | Application-consistent (logical) | Application-consistent (quorum/ consensus) |
| RPO | Poor (interval between backups) | Moderate (asynchronous: minutes to hours; synchronous: near zero) | Good/excellent (asynchronous: minutes; synchronous: near zero) | Excellent (near zero) |
| RTO | Poor (time to restore data and restart app) | Moderate (minutes): time to switch GLB and restart app and crash recovery (journal replay) | Good (minutes): time to switch GLB and restart app (faster due to logical consistency) | Excellent (near zero): heartbeat timeout and automatic system reorganization |
| DR process trigger | Manual (human required) | Manual or automated | Manual or automated (application-assisted failover) | Automatic (application/middleware handles self-healing) |
| Process ownership | Platform/ infrastructure team | Storage/ infrastructure team | Application/ development team | Application/ development team |
| Required Capabilities |
|
|
|
|
| Network sensitivity | Low (only for backup transfer) | High (especially synchronous) | Moderate | High (impacts write latency) |
| Cost profile | Lowest (cold standby) | Moderate (warm standby, specialized storage) | Moderate (warm standby, application licenses) | Highest (active/active, high resource consumption) |
GitOps and automation for DR
The GitOps methodology introduces a declarative and automated approach to managing infrastructure and application configurations. By using a version control system like Git as the single source of truth for all OpenShift manifests, you can drastically simplify and accelerate cluster recovery processes.
Cluster recovery via manifest consistency
In a disaster scenario, the primary OpenShift cluster’s control plane may be lost entirely. GitOps ensures that the secondary (or target) cluster can be reliably and quickly restored to the same configuration as the primary by following these steps:
- Declarative configuration: All stateless application manifests (deployments, services, and routes), infrastructure configurations (network policies, role bindings), and even the state of the GitOps tool itself (application objects) are stored as YAML files in Git.
- Simplified cluster recovery: To recover, a new OpenShift cluster is provisioned, and the GitOps agent (e.g., ArgoCD or Flux) is installed. The agent is then pointed to the centralized Git repository.
- Automated reconciliation: The GitOps agent reads the last committed state from Git and automatically reconciles the target cluster to match that desired state. This eliminates manual configuration steps, reduces the chance of human error, and ensures that the application environment is built consistently across failure domains.
This automation significantly improves the RTO for the application’s stateless components, allowing services to be available much faster than traditional manual redeployment.

OpenShift disaster recovery using GitOps
The state problem
While GitOps excels at handling the stateless components of an application, it faces multiple fundamental limitations when dealing with stateful applications and their persistent data:
- Configuration vs. state: Git only manages configuration files (the desired state of Kubernetes objects). It cannot replicate or restore the actual data held within persistent volumes. The contents of a database, user session data, or object storage files are part of the runtime state, not declarative configuration.
- Persistent volume claims (PVCs): When the GitOps agent recreates a stateful application, it successfully recreates the PVC object. However, no underlying data protection mechanism (like volume replication) is in place. In that case, the PVC will either dynamically provision a new, empty volume or fail to bind, rendering the recovered application useless without its data.
It’s essential to note that GitOps can only restore Kubernetes objects; this means that any persistent data required for an application to function correctly must be restored for stateful applications, such as databases, to be brought back online. GitOps must be paired with a separate data replication solution (like volume replication or application-level replication) to handle the data plane for stateful workloads. GitOps ensures that the application code is running; the data replication solution ensures that the application data is present.
OpenShift disaster recovery using ODF and RHACM
The OpenShift Data Foundation (ODF) is Red Hat’s integrated software-defined storage solution for OpenShift Container Platform. It simplifies the management of persistent storage for containerized workloads and offers a unified platform for file, block, and object storage directly within the OpenShift cluster.
ODF extends the capabilities of OpenShift by providing a persistent data layer for stateful applications in both on-premises and hybrid cloud environments. It can work with both local and external storage solutions. Unlike traditional storage systems, which may require disparate drivers and operators for different storage types, ODF delivers a consolidated, Kubernetes-native approach to meet all persistent storage needs for the cluster.
ODF disaster recovery options
The DR strategies offered by OpenShift Data Foundation fall into the following categories:
- Metro Disaster Recovery (Metro-DR): The Metro DR strategy is used in cases where the latency between primary and secondary datacenters is typically less than 10 ms, such as within a metropolitan area or between availability zones. The data is replicated using synchronous replication, which guarantees a zero RPO and near-zero RTO for seamless business continuity.

Metro DR using ODF
To maintain quorum in the event of an outage, an arbiter node with a storage monitor service must be set up at a third location for the Ceph Storage cluster. This third site, housing the arbiter, can be located at a distance, allowing for up to 100 ms RTT from the storage cluster connected to the OpenShift Container Platform instances.
Regional Disaster Recovery (Regional-DR): The Regional DR configuration is intended for disaster recovery across greater geographical distances with higher latency.

Regional DR using ODF
Data is replicated using asynchronous replication, which results in a slightly higher RPO but maintains faster application performance by avoiding the overhead of synchronous latency, making it ideal for recovery between distant regions.
- Stretched Cluster DR: A stretched cluster solution can also be considered in cases where the latency between datacenters is less than 10 ms. However, in contrast to Metro DR, the stretched cluster is configured in a single OpenShift cluster. An arbiter node is used to maintain quorum for stretched cluster configurations.
Both the Metro and Regional DR solutions are comprehensive offerings that integrate key Red Hat technologies to provide robust application and data mobility across OpenShift Container Platform clusters.
ODF disaster recovery components
The core components required to implement Metro and Regional DR solutions include Red Hat Advanced Cluster Management for Kubernetes, Red Hat Ceph Storage, and the OpenShift Data Foundation. These components are orchestrated via the OpenShift DR operator. It is important to note, however, that accessing these specific disaster recovery capabilities within the ODF ecosystem requires an additional OpenShift Data Foundation Advanced subscription. The role of these components is briefly explained below:
- Red Hat Advanced Cluster Management for Kubernetes (RHACM): RHACM serves as the central control plane, offering a comprehensive management solution for multi-cluster OpenShift environments. It consists of a “Hub” cluster, which hosts the management components, and “Managed Clusters,” which are the OpenShift clusters under its control. RHACM is crucial for orchestrating DR policies and operations.
- Red Hat Ceph Storage: This is the robust, open, and massively scalable software-defined storage platform that underpins OpenShift Data Foundation. Ceph provides the distributed storage capabilities, combining stable Ceph storage with management tools and support services. It’s designed for petabyte-scale storage, significantly lowering the cost of managing enterprise data growth in cloud deployments.
- OpenShift Data Foundation (ODF): ODF itself provides the necessary interfaces and management layers to provision and manage storage for stateful applications within an OpenShift Container Platform cluster. It utilizes Red Hat Ceph Storage as its backend storage provider. Rook, a Kubernetes-native storage orchestrator, manages the lifecycle of Ceph within ODF. At the same time, the Ceph CSI driver enables the provisioning and management of persistent volumes for stateful applications.
- OpenShift DR Operator: This is the specialized disaster recovery orchestrator designed specifically for stateful applications across peer OpenShift clusters. Deployed and managed via RHACM, the OpenShift DR operator provides cloud-native interfaces to orchestrate the entire lifecycle of an application’s state on persistent volumes during failover and failback operations.
Using Trilio for OpenShift disaster recovery
When discussing disaster recovery approaches for OpenShift, several third-party solutions are available, each offering a unique approach to data protection. A widely adopted solution is Trilio, which stands out as a comprehensive solution offering application mobility and robust backup functionalities.
Trilio is engineered as an enterprise-grade platform, explicitly addressing the multifaceted data protection and disaster recovery needs of modern organizations. It differentiates itself through a robust suite of advanced features. These include application-consistent backups that enable precise point-in-time recovery along with granular restores that provide detailed control over which data is restored.
Here are some of the key features offered by Trilio:
- Automated backup: Trilio enables the computerized scheduling of point-in-time backups coupled with flexible recovery options. This ensures consistent data protection for applications, adhering strictly to defined policies and procedures.
- Continuous restore for efficient DR: Trilio’s continuous restore feature forms the bedrock of highly effective disaster recovery strategies. It guarantees rapid application restoration post-disaster, irrespective of the underlying cloud provider. Trilio’s innovative architecture takes this one step further, allowing you to have a single, dedicated DR cluster that acts as the constant restore target for multiple primary application clusters. This approach significantly reduces infrastructure costs compared to traditional one-to-one primary-to-DR cluster models.
- Cost-effective regional DR: Traditional regional disaster recovery for OpenShift often involves deploying two active clusters across distinct geographical regions, leading to complex and expensive application and network configurations. Trilio provides a more agile and cost-efficient alternative. By storing backups in geographically redundant storage, Trilio facilitates recovery to a secondary region without the constant overhead of an active-active setup.
- Migration and platform portability: A significant advantage of using Trilio is its robust transform capabilities, which deliver true platform portability and effectively eliminate vendor lock-in for OpenShift applications. This feature enables organizations to seamlessly migrate applications across diverse environments, including between different cloud platforms and on-premises clusters. For example, if cloud costs become prohibitive, customers can easily migrate applications back to their own infrastructure. Trilio’s ability to modify backups on the fly ensures business continuity and future-proof application deployments, empowering organizations to select the optimal platform as their needs and cost structures evolve.
- Multi-Cluster Management Integration: Trilio’s integration with Red Hat Advanced Cluster Management for Kubernetes (RHACM) enhances its utility by enabling the definition and orchestration of policy-driven data protection. This unified management spans a wide array of Kubernetes deployments, encompassing hybrid, multi-cloud, and edge environments, centralizing control over data protection strategies.
- Full Orchestration with Ansible: Trilio also fully supports Ansible for hands-off management. Using the Certified Ansible Role allows you to automate your backup workflows and keep them consistent. It’s a straightforward way to centralize your data protection strategy across all your Kubernetes deployments.
Recommendations
The following are some recommendations for implementing DR solutions for OpenShift environments:
- Define clear RTO and RPO objectives: Before choosing any DR solution, precisely define objectives such as RTO and RPO for each application in your environment. These metrics will dictate the most appropriate and cost-effective DR strategy.
- Embrace GitOps for configuration management: Utilize Git as the single source of truth for all OpenShift application manifests and configurations. Its declarative approach simplifies and accelerates the recovery of stateless components, ensuring consistency and minimizing manual errors during a DR event.
- Implement comprehensive data protection for stateful workloads: Recognize that GitOps alone cannot recover stateful workloads. Consider using solutions such as OpenShift Data Foundation, which provides support for both the Metro and Regional DR approaches.
- Backups: Think about using specialized tools, such as Trilio for OpenShift, to handle robust application-consistent backups, granular restores, and cross-cluster mobility for both stateless and stateful components. This ensures the integrity and availability of your critical application data in the event of a disaster.
- Automate failover and failback processes: Minimize human intervention during a disaster by automating failover and failback procedures wherever possible. Leverage solutions such as Red Hat Advanced Cluster Management (RHACM) and Ansible Automation Platform to orchestrate the entire DR process.
- Regularly test your DR Plan: A DR plan is only as good as its last test. Conduct regular, realistic DR drills to validate the effectiveness of your chosen strategy, identify potential bottlenecks, and refine procedures. Test both failover and failback scenarios to ensure smooth transitions in both directions.
- Design for network connectivity across sites: Ensure robust and appropriate network connectivity between your primary and secondary DR sites. This includes sufficient bandwidth and latency management, especially for synchronous replication or distributed stateful workloads that require east-west communication paths.
Learn How To Best Backup & Restore Virtual Machines Running on OpenShift
Conclusion
Effective disaster recovery for OpenShift environments is paramount for ensuring continuous business operations in today’s cloud-native landscape. Although the risks to OpenShift are essentially the same as those of any other enterprise technology environment, disaster recovery strategies for an OpenShift environment differ fundamentally from traditional methods. With OpenShift, there are more dependencies due to the distributed nature of containers, which can make infrastructures vulnerable to single points of failure. Several solutions are available for this purpose. Ultimately, the optimal choice depends on an organization’s specific recovery time and recovery point objectives, budget, and existing infrastructure, requiring a tailored approach.
Like This Article?
Subscribe to our LinkedIn Newsletter to receive more educational content