When OADP Is Not Enough: Why Enterprises Turn to Trilio

Amazon Elastic Kubernetes Service simplifies cluster operations through AWS’s managed control plane, but disaster recovery planning remains the operator’s responsibility. AWS Backup for EKS provides managed EBS snapshot scheduling, but application-aware recovery isn’t covered out of the box. Advanced scenarios, like Kubernetes resources, coordinating distributed stateful workloads, and orchestrating cross-region failover, require additional tooling.

Most organizations layer Velero or commercial solutions on top of AWS Backup for comprehensive DR capabilities. For some workloads, Velero’s community-supported functionality combined with manual procedures suffices. For others, cross-region IAM complexity, strict RTO requirements under one hour, or the need for guaranteed incident support adopting enterprise solutions with AWS Marketplace integration.

This article examines five best practices for building resilient EKS environments, starting with baseline protection and progressing through cross-region failover, application consistency, testing rigor, and solution selection aligned with operational maturity.

Summary of key EKS disaster recovery best practices

Best practiceDescription
Establish baseline DR protectionDeploy Velero with AWS plugin for EBS snapshot management, configure IAM roles for cross-account access, and validate basic restore operations within regions. Native AWS Backup for EKS handles infrastructure snapshots effectively but lacks application awareness, Kubernetes resource coordination, and orchestration for cross-region restores.
Address cross-region failover complexityVelero enables backup replication to S3 buckets in alternate AWS regions but requires manual resource modification during restore. Organizations must choose between documented manual procedures (acceptable for infrequent DR) or automated restore transforms (necessary for regular testing and strict RTO requirements under an hour).
Orchestrate application-consistent backupsVelero hooks provide per-pod consistency commands, but production environments require coordinated execution across distributed applications. Implement dependency-aware backup workflows that ensure that databases complete quiescing before dependent services snapshot EBS volumes and capture Kubernetes resources.
Test DR procedures regularly and measure RTOSchedule quarterly DR tests using isolated VPCs and separate AWS accounts to validate backup integrity and measure actual recovery times. Test results often reveal gaps between theoretical RTO estimates and real-world restore duration, including EKS cluster provisioning, VPC peering setup, IAM policy updates, and S3 data transfer delays.
Choose solutions aligned with operational maturityMatch DR tooling to team expertise, on-call capabilities, and acceptable risk during production incidents. Community-supported Velero and vendor support with AWS Marketplace integration represent different risk/cost tradeoffs that organizations should evaluate based on the business impact of extended outages.

Automated Kubernetes Data Protection & Intelligent Recovery

Perform secure application-centric backups of containers, VMs, helm & operators

Use pre-staged snapshots to instantly test, transform, and restore during recovery

Scale with fully automated policy-driven backup-and-restore workflows

Establish baseline DR protection with AWS-native and Kubernetes tools

AWS provides two primary approaches for EKS backup: AWS Backup for EKS and Velero with the AWS plugin. AWS Backup operates at the infrastructure layer, scheduling EBS snapshot creation through native AWS APIs. The service integrates with AWS Organizations for centralized policy management and provides backup vaults with encryption at rest. However, this infrastructure-focused approach stores Kubernetes resources in a proprietary format, limiting flexibility during restore operations. Granular restoration of individual namespaces, cross-cluster transformations like StorageClass modifications, and selective resource recovery become operationally complex compared to solutions designed specifically for Kubernetes-native workflows.

The diagram below breaks down both approaches side by side, detailing the IAM permissions each requires and the four operational stages any baseline DR implementation must work through. Understanding these requirements upfront helps you anticipate the configuration effort involved before committing to either approach.

AWS Backup for EKS vs. Velero with AWS Plugin: key capabilities, limitations, IAM requirements, and baseline DR workflow

AWS Backup for EKS vs. Velero with AWS Plugin: key capabilities, limitations, IAM requirements, and baseline DR workflow

Velero addresses the gap described above by understanding both Kubernetes resources and AWS storage primitives. Modern Velero deployments leverage the EBS CSI Driver with VolumeSnapshot CRDs for application-aware backups governed directly by the Kubernetes API. When backing up a namespace, Velero captures Deployment manifests, Service definitions, and Ingress configurations, along with persistent volume data via CSI snapshots. The backup process serializes Kubernetes API objects to JSON and stores them in S3, while creating VolumeSnapshot resources that the EBS CSI Driver translates into native AWS EBS snapshots.

Basic restore testing within the same region validates your backup configuration without cross-region complexity:

# Create backup for production namespace
 velero backup create prod-backup \
 --include-namespaces production \
 --storage-location aws-backup-s3


# Validate backup completion
 velero backup describe prod-backup


# Test restore to isolated namespace
 velero restore create prod-test-restore \
 --from-backup prod-backup \

This workflow confirms that IAM permissions work correctly, EBS snapshots complete successfully, and Kubernetes resources restore with proper relationships intact. The namespace mapping prevents test restores from overwriting production workloads.

While AWS Backup and Velero provide foundational protection, production environments often expose operational gaps during cross-region failover or multi-application recovery scenarios. Manual IAM configuration across accounts, complex cross-region resource transformations, and a lack of dependency-aware backup orchestration create an operational burden during incidents. 

Enterprise solutions, such as Trilio, available through AWS Marketplace, address these gaps with automated restore transforms, preconfigured IAM policies, and coordinated backup workflows that respect application dependencies. These capabilities are particularly valuable for organizations that require sub-hour RTOs or manage dozens of stateful applications across multiple regions.

Address cross-region failover operational complexity

AWS regions provide geographic separation for disaster recovery, with production clusters typically pairing us-east-1 with us-west-2 or eu-west-1 with eu-central-1. Velero supports backup replication through S3 cross-region replication, providing a foundation for regional failover. The operational challenge emerges during restore operations when region-specific infrastructure differences require manual intervention.

Cross-region failover resource transformations between us-east-1 (production) and us-west-2 (DR) with four manual configuration changes

Cross-region failover resource transformations between us-east-1 (production) and us-west-2 (DR) with four manual configuration changes

In this example, a production EKS cluster in us-east-1 references infrastructure that doesn’t exist in us-west-2. StorageClass definitions use gp3 volumes optimized for us-east-1 pricing and performance characteristics. IAM role ARNs embed the production account ID arn:aws:iam::111111111111:role/app-service. Application LoadBalancer Ingress annotations reference security group sg-0abc123 that exist only in the us-east-1 VPC. Subnet IDs for node placement—subnet-12345a and subnet-12345b—correspond to us-east-1 availability zones.

The manual cross-region restore workflow involves multiple steps where human error can extend recovery time. This procedure assumes a skilled operator executing each step correctly under production incident pressure. In practice, typos in security group IDs, incorrect IAM role ARN formatting, or missed ConfigMap updates extend recovery time by 30-60 minutes per error.

Manual cross-region EKS restore workflow: eight sequential steps totalling 33–58 minutes under ideal conditions, where each error in the four error-prone stages

Manual cross-region EKS restore workflow: eight sequential steps totalling 33–58 minutes under ideal conditions, where each error in the four error-prone stages

Pre-built DR clusters reduce RTO by eliminating the time required to provision clusters from your recovery window. Imagine that an EKS cluster in us-west-2 sits idle, costing approximately $73/month for the control plane plus minimal node group capacity to maintain cluster health. When disaster strikes, you restore application workloads to this existing cluster rather than waiting for the cluster to be created. The trade-off appears on your AWS bill: You either pay for standby infrastructure or accept longer recovery times.

Automated restore transforms eliminate manual modification work by applying resource changes programmatically during restore operations. Solutions that understand both Kubernetes semantics and AWS infrastructure can:

  • Detect StorageClass references and replace them based on the target cluster context
  • Identify IAM role ARNs in service account annotations and substitute region-appropriate values
  • Update security group IDs by looking up equivalent groups in the target VPC based on tags
  • Modify subnet selections for node groups based on availability zone mappings

The automation decision hinges on your RTO requirements and testing frequency. Organizations that require recovery within 60 minutes and testing quarterly face operationally risky manual procedures with too many steps for error-prone execution under pressure. Organizations that tolerate 4-hour RTOs and test annually may find documented procedures sufficient.

Orchestrate application-consistent backups across dependencies

Velero hooks enable application-specific consistency commands before and after backup operations. Let’s review an example with a PostgreSQL StatefulSet, which might include annotations that coordinate database checkpoint operations with snapshot timing.

apiVersion: v1
 kind: Pod
 metadata:
 name: postgres-0
 annotations:
 pre.hook.backup.velero.io/command: '["/bin/bash", "-c", "PGPASSWORD=$POSTGRES_PASSWORD psql -U postgres -c \"SELECT pg_start_backup(''backup'');\""]'
 post.hook.backup.velero.io/command: '["/bin/bash", "-c", "PGPASSWORD=$POSTGRES_PASSWORD psql -U postgres -c \"SELECT pg_stop_backup();\""]'
 pre.hook.backup.velero.io/timeout: 3m

The pre-hook flushes write-ahead logs to disk and creates a consistent checkpoint. Velero waits for successful completion (with a 3-minute timeout) before triggering EBS snapshots. The post-hook releases the backup mode, allowing normal database operations to resume. This per-pod approach maintains application-level consistency but doesn’t coordinate service dependencies.

Production environments run distributed applications where backup order matters. Consider an e-commerce platform with PostgreSQL for order data, Redis for session caching, and three payment processing microservices. Backing up the application tier while database transactions remain in flight creates restored applications that reference uncommitted data. The cache might hold invalidation keys for records that don’t exist in the restored database state.

Dependency-aware workflows coordinate backup execution across service boundaries, as shown below.

Application-consistent backup workflow with five dependency-aware stages

Application-consistent backup workflow with five dependency-aware stages

Implementing this orchestration manually with Velero requires developing a custom controller that watches backup resources and coordinates execution based on annotation-based dependencies. Organizations must build logic to parse dependency declarations, manage execution ordering, handle hook failures gracefully, and retry operations when transient issues occur. This development effort diverts engineering resources from core platform work while introducing custom code that requires ongoing maintenance as Kubernetes and Velero versions evolve.

Application-aware backup policies address these coordination challenges by treating multi-component applications as atomic units rather than collections of independent pods. These policies understand application topology—which services depend on which databases, which microservices share state through caches, and which Helm charts deploy interdependent resources. During backup, the policy orchestrates hooks across all application components in the correct sequence, validates that each phase completes successfully before proceeding, and provides centralized monitoring for the entire application backup rather than tracking individual pod-level operations. Solutions that offer this capability eliminate the need for custom controller development while providing consistent backup workflows that teams can test, validate, and rely on during actual recovery scenarios.

Test disaster recovery procedures regularly and measure RTO

Theoretical RTO estimates assume perfect execution of documented procedures under ideal conditions. Reality introduces delays: EKS cluster provisioning takes longer than AWS documentation suggests, S3 cross-region data transfer saturates available bandwidth, or manual IAM configuration steps require multiple attempts due to policy syntax errors or incorrect trust relationships. Regular DR testing reveals these gaps between theory and practice. 

A good test workflow validates multiple aspects simultaneously.

Backup integrity verification confirms that backups contain viable data and complete resource definitions. Restored databases should start successfully, application pods should initialize without CrashLoopBackOff errors, and services should achieve a ready state. If restored applications fail to start, the backup configuration may be excluding critical ConfigMaps or Secrets, or EBS snapshots may capture an inconsistent state due to missing pre-backup hooks.

Actual RTO measurement provides data for capacity planning and incident response expectations. Time each phase separately, as shown in the example below.

DR testing estimated vs. actual RTO across five typical recovery phases

DR testing estimated vs. actual RTO across five typical recovery phases

This example reveals a 2.3× difference between the theoretical estimate and the actual measured recovery time. The S3 data transfer took longer than anticipated due to concurrent backups consuming bandwidth, and pod initialization delays stemmed from container image pulls from ECR in a different region. Manual configuration errors required multiple attempts to correctly configure IAM role trust policies.

Cross-region configuration validation surfaces AWS-specific modifications required during restore. Test restores to alternate regions expose StorageClass incompatibilities (gp3 not available in older regions), IAM role ARN formatting errors, or Route53 health check configuration gaps. Discovering these issues during quarterly tests rather than during actual us-east-1 outages significantly reduces incident duration.

Non-disruptive testing uses VPC isolation and security groups to prevent test workloads from impacting production:

apiVersion: networking.k8s.io/v1
 kind: NetworkPolicy
 metadata:
 name: dr-test-isolation
 namespace: dr-test-2024-q4
 spec:
 podSelector: {}
 policyTypes:
 - Ingress
 - Egress
 egress:
 - to:
 - namespaceSelector:
 matchLabels:
 name: dr-test-2024-q4

This NetworkPolicy restricts test-namespace pods to communicating only within the test namespace, preventing accidental connections to production RDS instances or external payment-processing APIs. The empty ingress rules block all inbound traffic, ensuring that test workloads remain completely isolated.

Update DR documentation based on test findings. If a test revealed that IAM role assumption required specific ExternalID parameters not documented in runbooks, add explicit examples with correct syntax. If security group references caused LoadBalancer creation failures, document the mapping between production and DR security group IDs with verification commands. Documentation improvements compound over multiple test cycles, reducing human error during actual incidents when cognitive load peaks.

Choose DR solutions aligned with operational maturity and risk

DR tooling selection balances capabilities against operational complexity and risk tolerance during production incidents. Velero provides robust backup and restore functionality as a community-supported CNCF project. Organizations with deep Kubernetes and AWS expertise, flexible RTO requirements, and tolerance for community-supported models may find Velero sufficient. Others require guaranteed support during regional AWS outages or automation that eliminates manual steps during high-stress incident response.

The support model difference becomes particularly relevant during unexpected outages. Imagine a us-east-1 regional outage at 2 AM that affects both production and your on-call engineer’s ability to access documentation. When your production database restore fails with cryptic EBS snapshot errors or S3 permission denied messages, community support means opening a GitHub issue and waiting for community maintainers to investigate. There’s no guaranteed response time, no escalation path for production outages affecting revenue, and no direct access to engineers familiar with your specific IAM configuration or VPC topology.

The diagram maps these differences across seven concrete decision factors, showing where the two approaches diverge based on your environment’s actual requirements rather than general preferences. RTO requirements and cross-region failover complexity are often the deciding factors—if sub-hour recovery and automated failover are non-negotiable, the community model’s manual workflows and variable response times introduce unacceptable risk.

EKS DR solution selection framework

EKS DR solution selection framework

Enterprise solutions like Trilio provide vendor support with committed SLAs and AWS-specific expertise:

  • 24/7 support with response times based on severity level (15-minute response for production-down incidents)
  • Root cause analysis for failed restore operations, including IAM policy debugging
  • Configuration review before DR tests to identify potential cross-region issues
  • Direct engineering escalation for production-blocking problems during regional outages
  • Hotfixes and patches without waiting for community release cycles or dependency updates

Organizations must evaluate the cost of this support against the potential revenue impact from extended outages. As an example, an e-commerce platform generating $50,000 in hourly revenue during peak shopping seasons faces substantial losses if DR recovery time extends from 2 hours to 8 hours due to troubleshooting delays without vendor support. 

Be sure to match tooling complexity to team expertise and on-call capabilities. A platform team with deep Kubernetes knowledge, extensive AWS IAM experience, and a proven track record of troubleshooting Velero operations may operate effectively with community support. On the other hand, teams with general cloud operations backgrounds face different challenges. If they manage EKS alongside other responsibilities and support several dozen applications, they might benefit from vendor support that accelerates incident resolution and reduces the troubleshooting burden during outages. 

Conclusion

Building resilient EKS environments requires matching disaster recovery capabilities to application requirements, operational maturity, and risk tolerance. AWS Backup for EKS provides infrastructure-level snapshot management, while Velero adds Kubernetes resource awareness and application hooks that effectively handle many DR scenarios.

Production environments often encounter operational complexity during cross-region failover—IAM role transformations, security group remapping, StorageClass conversions—that extend beyond the capabilities of community-supported tooling. Regular quarterly testing reveals these gaps early, providing actual RTO measurements rather than theoretical estimates and enabling informed decisions about automation investment and vendor support based on operational experience rather than marketing claims.

Table Of Contents

Like This Article?

Subscribe to our LinkedIn Newsletter to receive more educational content

Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.