Reference Guide: Optimizing Backup Strategies for Red Hat OpenShift Virtualization

High availability (HA) in Kubernetes represents the difference between a production-ready cluster that maintains business operations and one susceptible to costly downtime. The impact of Kubernetes downtime extends beyond immediate service disruption. For example, an outage in a healthcare system’s patient records application doesn’t just pause operations; it potentially impacts patient care decisions. Achieving true high availability in Kubernetes requires more than just redundant components. Building resilient clusters requires careful planning across multiple layers: from control plane redundancy to application-level resilience. Each layer must address specific recovery objectives, data consistency requirements, and varying workload availability needs. While Kubernetes provides the building blocks for high availability, implementing them correctly requires understanding specific failure scenarios and recovery procedures. Each layer presents unique challenges that must be addressed through proper architectural patterns and backup strategies. Throughout this article, we’ll explore practical approaches to achieve and maintain high availability in Kubernetes environments, with concrete examples and configurations you can apply to your infrastructure.

Summary of key Kubernetes high availability concepts

Concept Description 
Planning high availability in Kubernetes Analyze business needs and define HA requirements, including downtime limits, data loss thresholds, and availability targets.
Control plane HA architecture Eliminate single points of failure by configuring redundant control plane components (API servers, schedulers, and controllers) and a distributed etcd cluster.
Worker node HA strategies Create a resilient compute layer with node redundancy and automatic failover for seamless workload availability during node failures and maintenance. 
Application-level HA Design highly available applications using Kubernetes features like workload distribution, state management, and service discovery.
Data protection for HA Implement robust data protection mechanisms to safeguard against data loss and ensure application recoverability in the event of failures. 
Testing and validation for HA  Validate high availability through comprehensive testing, including chaos engineering, monitoring critical components, and multi-region failover testing.

Planning high availability in Kubernetes

Before diving into technical implementations, realize that your Kubernetes HA strategy must align with specific business requirements. Start by identifying critical workloads and their availability requirements.  For example, a system with 99.99% uptime (often called “four nines”) allows for only 52.34 minutes of downtime per year or about 4.19 minutes per month. This level of availability is crucial for customer-facing applications, where even brief outages directly impact revenue and customer trust. In contrast, 99.9% availability (“three nines”) allows for 8.45 hours of downtime per year or about 43.12 minutes per month. This level might be acceptable for internal tools like development environments or reporting systems, where brief interruptions don’t immediately impact customer experience or business revenue. This initial assessment forms the foundation for all subsequent architectural decisions. Balancing Kubernetes Availability Requirements

Balancing Kubernetes Availability Requirements

Automated Kubernetes Data Protection & Intelligent Recovery

Perform secure application-centric backups of containers, VMs, helm & operators

Use pre-staged snapshots to instantly test, transform, and restore during recovery

Scale with fully automated policy-driven backup-and-restore workflows

Business availability targets

Availability targets directly influence your Kubernetes architecture. Here’s what these targets mean in practice: a four nines level availability requirement means your services must be accessible 99.99% of the time, allowing for no more than 52.34 minutes of service unavailability per year. This differs from simple system uptime, as a system can run (up) but not serve requests properly (unavailable).

Your infrastructure must include comprehensive monitoring and alerting systems to detect and respond to potential failures before they impact service availability. Consider also the geographical distribution of your users: Applications serving a global user base might require multi-region deployments to meet latency and availability requirements.

Understanding RPO/RTO in a Kubernetes context

Recovery point objectives (RPOs) and recovery time objectives (RTO) are important metrics that shape your backup and recovery strategy. 

Your RPO defines your maximum acceptable data loss in case of failure. For instance, if your application requires zero data loss, you’ll need synchronous data replication. A 15-minute RPO allows for more flexible backup cycles.

RPO/RTO planning

RPO/RTO planning

RTO determines how quickly you must restore services after an incident. Applications requiring sub-minute recovery need hot standby configurations, while those tolerating longer downtimes can use cold standby options. 

These metrics must be defined at both the infrastructure and application levels, as different components may have varying recovery requirements.

Impact on cluster architecture decisions

Your availability requirements drive specific architectural choices in your Kubernetes deployment. A single master configuration might provide 99.5% availability, while a multi-master setup is necessary for 99.9% or higher. The choice between stacked and external etcd topology depends on your data consistency requirements. Your architecture must account for node distribution and load balancer configurations to maintain API server availability during zone failures. 

HA Architecture Components and Dependencies

HA Architecture Components and Dependencies

The above diagram illustrates how specific business requirements translate into architectural decisions in a Kubernetes HA implementation. On the left, we see three core business requirements that drive HA design:

  • Uptime requirements (such as 99.99% availability)
  • Data loss tolerance (defined by RPO/RTO targets)
  • Recovery time needs (how quickly systems must recover)

These requirements directly influence the choice of architectural components shown on the right. For example:

  • Uptime requirements drive decisions about control plane redundancy and network resilience
  • Data loss tolerance determines the approach to storage and etcd cluster configuration
  • Recovery time needs influence both control plane and data layer design

 

Cost vs. availability trade-offs

Higher availability inherently increases operational costs through additional infrastructure components and operational complexity. Moving from 99.9% to 99.99% availability often requires doubling your infrastructure investment in redundant systems, cross-zone data replication, and comprehensive backup solutions. 

You must consider balancing these costs against business impact. Consider not just infrastructure costs, but also the operational overhead of maintaining complex HA configurations. Factor in the costs of training teams, maintaining documentation, and potentially dealing with increased troubleshooting complexity. 

Some organizations opt for a hybrid approach, implementing different availability levels for different application tiers based on their business criticality. For example, 

  • Payment Processing: 99.99% availability (revenue-critical)
  • Product Catalog: 99.95% availability (revenue-impacting)
  • Admin Dashboard: 99.9% availability (internal tooling)
  • Development Environment: 99.5% availability (non-critical)

Watch this 1-min video to see how easily you can recover K8s, VMs, and containers

Control plane HA architecture

A production-grade Kubernetes control plane requires at least three master nodes distributed across different availability zones. This setup prevents both hardware failures and zone outages from disrupting cluster operations. 

The control plane consists of multiple components: the API server, which handles all API operations; the controller manager, which manages various controllers; the scheduler, which assigns pods to nodes; and etcd, which stores the cluster’s state.

etcd clustering setup

etcd is a distributed key-value store that holds all cluster state data, making it critical for cluster operations. It uses the Raft consensus algorithm to maintain consistency across multiple nodes. A three-node etcd cluster provides optimal balance between reliability and write performance—it can tolerate one node failure while maintaining a quorum. Five nodes provide higher availability but increase write latency.

Here’s a practical configuration:

# Example etcd configuration
etcd:
  local:
    serverCertSANs:
      - "192.168.1.101"
    peerCertSANs:
      - "192.168.1.101"
  extraArgs:
    initial-cluster: "master1=https://192.168.1.101:2380,master2=https://192.168.1.102:2380,master3=https://192.168.1.103:2380"
    initial-cluster-state: new
    name: master1
    listen-peer-urls: https://192.168.1.101:2380
    listen-client-urls: https://192.168.1.101:2379
    advertise-client-urls: https://192.168.1.101:2379
    initial-advertise-peer-urls: https://192.168.1.101:2380

Multi-master configuration

The control plane components can run in either stacked or external etcd configurations. In a stacked configuration, etcd runs on the same nodes as other control plane components, simplifying management but potentially impacting performance. External etcd configurations provide better isolation and performance but require additional infrastructure.

The API server operates in active-active mode, where all instances serve traffic simultaneously. However, the controller manager and scheduler operate in active-standby mode to prevent conflicts in cluster management operations. Leader election ensures that only one instance is active at a time.

# Example kubeadm configuration
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
nodeRegistration:
  name: master1
  criSocket: "unix:///var/run/containerd/containerd.sock"
localAPIEndpoint:
  advertiseAddress: "192.168.1.101"
  bindPort: 6443
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: 1.26.0
controlPlaneEndpoint: "k8s-api.example.com:6443"
networking:
  podSubnet: "10.244.0.0/16"

Load balancer requirements

Load balancing for the Kubernetes API server requires special consideration due to its TCP-based nature and TLS termination requirements. The load balancer must support TCP passthrough to maintain end-to-end TLS encryption. Session affinity (sticky sessions) isn’t required as the API server is stateless, but connection draining during updates is important to prevent request failures.

# Example HAProxy configuration
frontend k8s-api
    bind *:6443
    mode tcp
    option tcplog
    default_backend k8s-api-backend

backend k8s-api-backend
    mode tcp
    balance roundrobin
    option tcp-check
    server master1 192.168.1.101:6443 check fall 3 rise 2
    server master2 192.168.1.102:6443 check fall 3 rise 2
    server master3 192.168.1.103:6443 check fall 3 rise 2

The diagram below illustrates the complete architecture of a highly available Kubernetes control plane deployment across three availability zones. This multi-layered architecture ensures continuous cluster operation even if individual components or entire zones fail.

Control plane HA

Control plane HA

The image above depicts an architecture that provides multiple layers of redundancy:

  • No single point of failure in any layer
  • Automatic failover between availability zones
  • Consistent data storage through distributed etcd
  • Load-balanced access to all API servers

Learn about the features that power Trilio’s intelligent backup and restore

Worker node HA strategies

Worker node high availability forms the foundation of a resilient Kubernetes cluster. While the control plane manages the cluster state, worker nodes execute the actual workloads. A comprehensive worker node HA strategy must address both planned maintenance and unexpected failures while ensuring application availability.

Node pools and zone distribution

Node pools represent groups of worker nodes with similar configurations and purposes. In a high-availability setup, nodes should be distributed across multiple availability zones to protect against zone-level failures. This distribution isn’t just about spreading nodes—it’s about maintaining enough capacity in each zone to handle workload redistribution during failures.

The key here is capacity planning. Each zone should maintain enough spare capacity to handle the redistribution of workloads from a failed zone. For example, if you’re running at 66% capacity across three zones, losing one zone would allow the remaining two zones to absorb the redistributed workloads without oversubscription.

# Example node pool configuration in cloud provider
apiVersion: compute.cloud.example/v1
kind: NodePool
metadata:
  name: production-workload
spec:
  replicas: 6  # 2 nodes per zone
  distribution:
    zones: 
      - zone-1
      - zone-2
      - zone-3
  nodeConfig:
    machineType: e2-standard-4
    labels:
      pool: production
      environment: prod

Pod disruption budgets

Pod disruption budgets (PDBs) are Kubernetes’ way of ensuring application availability during voluntary disruptions. The concept revolves around maintaining a minimum number of healthy pods during operations like node drains or cluster upgrades.

PDBs work in conjunction with the Kubernetes scheduler and respect pod topology spread constraints. They’re particularly important for stateful applications where pod placement can impact performance and reliability. The PDB mechanism helps maintain application SLAs by preventing too many pods from being unavailable simultaneously.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: critical-service-pdb
spec:
  minAvailable: 2    # or use maxUnavailable: 1
  selector:
    matchLabels:
      app: critical-service

Recovery strategies during node failures

Node failure recovery involves multiple mechanisms working together. The process begins with failure detection, moves through workload evacuation, and ends with node replacement, if necessary. This process must be automated to minimize service disruption.

The node lifecycle in failure scenarios follows several stages:

  1. Detection: The node problem detector identifies an issue.
  2. Isolation: Kubernetes marks the node as unschedulable.
  3. Evacuation: Pods are rescheduled to healthy nodes.
  4. Recovery: Automated repair or replacement occurs.

Each layer of the worker node HA strategy builds upon the others to create a resilient infrastructure capable of handling various failure scenarios while maintaining application availability. Regular testing of these mechanisms, particularly through chaos engineering practices, helps validate the effectiveness of your HA implementation.

Application-level high availability

HA at the application level involves a complex interplay among infrastructure, application design, and operational patterns. Unlike infrastructure HA, which focuses on platform resilience, application HA demands a deep understanding of workload characteristics, data consistency, and business continuity needs. The fundamental goal is to maintain service availability despite component failures, network issues, or planned maintenance.

Stateless vs. stateful considerations

The distinction between stateless and stateful applications forms the base of HA strategy. Stateless applications follow the principles of immutable infrastructure: They can be created, destroyed, or replaced without data loss concerns. Each request contains all the information needed to process it, making horizontal scaling and failover straightforward.

Stateful applications, however, maintain data that must persist across sessions and survive pod restarts. This persistence requirement introduces complexity in several areas:

  • Data synchronization between replicas
  • Consistent ordering of operations
  • Backup and recovery procedures

Pod anti-affinity rules

Anti-affinity represents a critical concept in Kubernetes scheduling logic. It enforces constraints that prevent pods from colocating on the same infrastructure components, thereby improving fault tolerance. There are two types of anti-affinity:

  • Required anti-affinity: Hard rules that must be satisfied
  • Preferred anti-affinity: Soft rules that the scheduler attempts to satisfy

The scheduler evaluates these rules against topology keys: labels that represent failure domains like nodes, racks, or availability zones. This mechanism ensures proper workload distribution across your infrastructure’s failure domains.

Deploying an HA database

Let’s examine a complete example of deploying a highly available PostgreSQL database on Kubernetes using AWS infrastructure. 

The below StatefulSet configuration demonstrates several critical HA concepts: 

  • Running multiple replicas across availability zones
  • Configuring proper health checks
  • Setting up streaming replication
  • Managing persistent storage
apiVersion: apps/v1
kind: StatefulSet
metadata:
 name: postgres
spec:
 # Headless service for stable DNS names for each pod
 serviceName: postgres
 # Three replicas: one primary and two replicas spread across AZs
 replicas: 3
 selector:
   matchLabels:
     app: postgres
 template:
   metadata:
     labels:
       app: postgres
   spec:
     # Used for AWS IAM roles to access AWS services like KMS, S3 etc.
     serviceAccountName: postgres-sa  
     # Required for postgres user permissions on mounted volumes
     securityContext:
       fsGroup: 999
       
     # Init container runs before main container to handle primary/replica setup
     initContainers:
     - name: init-postgres
       image: postgres:16
       command:
       - /bin/bash
       - -c
       - |
         # postgres-0 becomes primary, others become replicas
         if [[ $(hostname) == postgres-0 ]]; then
           echo "Initializing primary node"
         else
           echo "Initializing replica node"
           # Create replica using streaming replication
           pg_basebackup -h postgres-0.postgres -D /var/lib/postgresql/data/pgdata -U replicator -v -P --wal-method=stream
         fi
       env:
       - name: PGPASSWORD
         valueFrom:
           secretKeyRef:
             name: postgres-secret
             key: replication-password
       volumeMounts:
       - name: postgres-data
         mountPath: /var/lib/postgresql/data

     containers:
     - name: postgres
       image: postgres:16
       env:
       - name: POSTGRES_PASSWORD
         valueFrom:
           secretKeyRef:
             name: postgres-secret
             key: password
       # Specify PGDATA path for PostgreSQL data directory
       - name: PGDATA
         value: /var/lib/postgresql/data/pgdata
       ports:
       - containerPort: 5432
       
       # Resource requests and limits suitable for production workload
       resources:
         requests:
           cpu: "2"
           memory: "8Gi"
         limits:
           cpu: "4"
           memory: "16Gi"
           
       # Health checks for container lifecycle management
       readinessProbe:
         exec:
           command: ["pg_isready", "-U", "postgres", "-h", "localhost"]
         initialDelaySeconds: 5
         periodSeconds: 10
       livenessProbe:
         exec:
           command: ["pg_isready", "-U", "postgres", "-h", "localhost"]
         initialDelaySeconds: 30
         periodSeconds: 10
         
       # Volume mounts for data and configuration
       volumeMounts:
       - name: postgres-data
         mountPath: /var/lib/postgresql/data
       - name: postgres-config
         mountPath: /etc/postgresql/postgresql.conf
         subPath: postgresql.conf

     # Ensure pods are distributed across AZs
     topologySpreadConstraints:
     - maxSkew: 1
       topologyKey: topology.kubernetes.io/zone
       whenUnsatisfied: DoNotSchedule
       labelSelector:
         matchLabels:
           app: postgres

     # Prevent multiple pods on the same node
     affinity:
       podAntiAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
         - labelSelector:
             matchLabels:
               app: postgres
           topologyKey: kubernetes.io/hostname

 # EBS volume configuration for each pod
 volumeClaimTemplates:
 - metadata:
     name: postgres-data
   spec:
     accessModes: [ "ReadWriteOnce" ]
     # Using our custom storage class for EBS gp3
     storageClassName: ebs-gp3-postgres
     resources:
       requests:
         storage: 100Gi

Data protection for HA

Data protection in Kubernetes extends beyond simple backup and restore operations. It requires a comprehensive strategy that ensures data consistency, minimal RPO/RTO, and seamless integration with application lifecycle management. While Kubernetes provides basic volume snapshots, enterprise environments demand more sophisticated solutions for true high availability.

Backup strategies

Traditional backup methods often prove inadequate in Kubernetes environments due to the inherent complexity of container ephemerality and state distribution. Applications running on Kubernetes typically span multiple components, making it essential to maintain consistency across these distributed elements during backup operations. Cross-namespace dependencies and resource relationships further complicate the backup process, requiring sophisticated orchestration to ensure complete and consistent captures.

For example, Trilio‘s approach addresses these challenges through application-aware backups. This method considers the entire application stack as a single unit, ensuring that all components are captured in a consistent state. The system intelligently handles dependencies between different parts of the application, maintaining referential integrity throughout the backup process.

Implementing automated backups

Automated backup implementation requires consideration of scheduling patterns and resource impact. Full backups provide complete state capture but consume significant resources, while incremental backups offer efficiency at the cost of slightly more complex recovery procedures. The key lies in balancing these approaches based on application activity patterns and business requirements.

Recovery time optimization

Recovery time optimization involves balancing speed with reliability. Parallel restore operations can significantly reduce recovery time but must be managed carefully to prevent resource contention.

The recovery process must be thoroughly tested and validated. This includes verifying application health post-recovery, ensuring data consistency, and validating service connectivity. Performance validation ensures that the recovered application meets its operational requirements. Regular recovery testing helps identify potential issues before they impact real recovery scenarios.

Testing and validation

Testing high availability isn’t just about technical validation—it’s about ensuring business continuity. Real-world production environments rarely fail in expected ways, and the testing approach must reflect this reality.

Chaos engineering practices

Start small and build confidence gradually. Begin with testing in non-peak hours and in isolated environments. For example, instead of immediately failing an entire region, start by testing how your applications handle a single pod failure, then a node failure, and then gradually increase the scope. Document every unexpected behavior—these are your most valuable insights.

Monitoring critical components

Focus on monitoring what matters to your business, not just what’s easy to measure. For example, if you’re running an ecommerce platform, monitoring pod CPU usage is important, but tracking successful checkout completions is critical. Build your monitoring strategy around your customer’s journey.

A practical approach used by successful organizations is to create monitoring dashboards that business stakeholders can understand. While technical metrics are important, being able to show the direct impact on business operations makes conversations about reliability investments much more productive.

Key metrics and alert rules

The most effective alerts are those that demand action. Many organizations suffer from alert fatigue because they alert on everything they can measure. Instead, focus on symptoms, not causes. For example, don’t alert on high CPU usage—alert on degraded customer response times.

Here’s an example of real-world prioritization:

  • Critical: Customer-impacting issues (service outages, failed transactions)
  • High: Issues that could become customer-impacting (degraded performance, reduced redundancy)
  • Medium: System health indicators (resource usage trending up)
  • Low: Informational (successful automated recoveries)

Multi-region failover testing

The hardest part of multi-region failover isn’t the technical implementation—it’s the human element. Regular failover testing often reveals that team members aren’t familiar with the procedures, documentation is outdated, or critical access credentials have expired.

Run semi-annual “game days” where teams practice failover procedures. Include not just the technical team but also customer support, business stakeholders, and communication teams. The goal isn’t just to test the technology but to practice the entire organizational response.

Learn about a lead telecom firm solved K8s backup and recovery with Trilio

Conclusion

High availability in Kubernetes is ultimately about balancing technical capabilities with operational reality. Throughout this article, we’ve explored the comprehensive approach needed to achieve true high availability in Kubernetes environments.

We began by pointing out that HA requires careful planning across multiple layers—from business requirements to technical implementation. The control plane architecture showed us how to build a resilient foundation with multi-master setups and proper etcd clustering. Worker node strategies revealed the importance of proper distribution and automatic recovery mechanisms, while application-level HA demonstrated the crucial differences between stateless and stateful workloads.

As you move forward with your Kubernetes HA implementation:

  • Start with a thorough assessment of your current environment and business requirements. Map your applications’ criticality to specific availability targets and recovery objectives.
  • Build incrementally: Begin with control plane HA, then expand to worker nodes, and finally implement application-level resilience. Test thoroughly at each stage.
  • Remember that high availability is a journey, not a destination. Regular testing, validation, and refinement of your HA strategy ensure that it evolves with your business needs.
  • Consider investing in specialized tools like Trilio for data protection and hot standby, as native Kubernetes capabilities might not meet enterprise requirements for backup and recovery.
  • Most importantly, build operational readiness. The most sophisticated HA setup is only as good as your team’s ability to maintain and recover it during incidents.