When OADP Is Not Enough: Why Enterprises Turn to Trilio

GKE clusters present a layered data protection challenge that goes well beyond backing up virtual machines. Production workloads span namespaces, persistent volumes provisioned across different storage classes, Helm-managed applications, and custom operators, all of which need to be captured together for a restore to be meaningful. At the same time, Google’s managed control plane handles much of the infrastructure, which creates a false sense of security: The cluster nodes may be resilient, but the application state and persistent data are the customer’s responsibility.

This article examines the core strategies and tools for protecting GKE workloads, from native Google options to third-party platforms, and covers the operational decisions that engineers need to make when designing a production-grade backup approach.

Summary of key GKE backup concepts

ConceptDescription
What GKE backup does and doesn’t protectGKE separates infrastructure resilience from application data protection. Node pools, cluster config, and external services like Cloud SQL fall outside the standard backup scope and require separate strategies.
Native Backup for GKE capabilities and limitationsGoogle’s Backup for GKE handles Kubernetes manifests and Persistent Disk volume snapshots, but excludes Filestore NFS volumes, container images, and state held in external services like Cloud SQL or Memorystore.
Application-centric vs. volume-centric backupCapturing PV snapshots while a database is actively writing produces an inconsistent state. Application-centric approaches capture manifests, metadata, and data together using pre-backup hooks to guarantee consistent recovery.
Backup scope: namespaces, labels, Helm, and operatorsBackup granularity (namespace-wide, label-selected, or Helm/operator-scoped) directly determines whether a restore produces a working application or a collection of objects with no functional context.
Retention policies and incremental backup efficiencyForever-incremental strategies using QCOW2 image chains avoid repeated full captures and reduce storage costs. Synthetic full backups merge incremental chains at retention time, keeping restore chains valid without re-capturing from the cluster.
Cross-cluster and multi-region recoveryRestoring to a different GKE cluster or region introduces storage class mismatches, namespace conflicts, and network policy differences that must be resolved before a restore can succeed.

Complete Backup Protection Designed for Kubernetes Running on OpenStack

What GKE backup actually covers

The most important thing to understand about GKE’s managed infrastructure is what it does not protect. Google’s control plane manages node availability, scales node pools, and handles API server uptime, but none of that extends to your application state. If a developer runs kubectl delete namespace production or a misconfigured deployment overwrites a PVC, the cluster’s built-in resilience is irrelevant—that data is gone unless you backed it up separately. Several categories of resources fall completely outside the standard backup scope:
  • External service state: Cloud SQL databases, Memorystore instances, and Filestore NFS volumes are not part of any cluster-level backup. Each requires its own protection strategy.
  • Cluster configuration: Node pool definitions, autoscaling settings, network policies, and enabled cluster features are not captured by application-layer backup tools.
  • Container images: Neither native GKE tooling nor most third-party platforms back up images from Artifact Registry or external registries.
The practical takeaway is that the GKE backup strategy requires thinking in layers. Cluster infrastructure is Google’s responsibility; everything running inside the cluster, including the external services your applications depend on, is yours.

Native Backup for GKE: capabilities, configuration, and limitations

Google’s Backup for GKE is a managed add-on that captures Kubernetes resource manifests and creates snapshots of Persistent Disk volumes. It operates through two custom resource types: BackupPlan, which defines what to back up and how often, and RestorePlan, which provides a reusable restore template pointing at a target cluster. Enabling the add-on and creating a basic plan takes only a few commands:
# Enable the Backup for GKE add-on on an existing cluster
gcloud container clusters update my-cluster \
  --update-addons=BackupRestore=ENABLED \
  --region=us-central1

# Create a BackupPlan targeting a specific namespace
gcloud beta container backup-restore backup-plans create my-backup-plan \
  --project=my-project \
  --location=us-central1 \
  --cluster=projects/my-project/locations/us-central1/clusters/my-cluster \
  --selected-namespaces=production \
  --include-volume-data \
  --backup-retain-days=14

# Trigger a manual backup
gcloud beta container backup-restore backups create my-backup-001 \
  --project=my-project \
  --location=us-central1 \
  --backup-plan=my-backup-plan
The add-on works well for straightforward use cases, specifically namespace-scoped backups of workloads using pd.csi.storage.gke.io provisioned volumes. When your requirements go further, the limitations become relevant:
  • Only volumes backed by the Persistent Disk CSI driver are captured. Filestore NFS volumes, Google Cloud NetApp Volumes, and other storage backends are excluded entirely.
  • Cluster-level configuration (node pools, network policies, enabled features, etc.) is not part of the backup.
  • Container images are not captured.
  • Cross-project restore recently reached general availability, which is a meaningful improvement for teams storing backups in isolated projects. Cross-region restore still has latency and configuration implications worth testing before relying on it for DR.
For teams running standard stateless or lightly stateful workloads entirely on Persistent Disk, the native add-on often covers the basic requirements. The gaps become significant when applications are managed by operators, use Helm, or depend on storage backends other than Persistent Disk. For teams whose requirements exceed what the native add-on covers, Trilio for Kubernetes (T4K) takes a different architectural approach. Rather than operating as an external agent, T4K is built entirely on Kubernetes-native primitives: Its API layer is a set of custom resource definitions, and its state lives in etcd alongside the rest of the cluster. The control plane manages backup, restore, and retention logic through dedicated controllers, while the data plane spins up short-lived datamover pods to transfer volume data to the backup target. Those pods exist only for the duration of a backup or restore operation, then terminate. Trilio for Kubernetes component architecture (source)

Trilio for Kubernetes component architecture (source)

Complete Backup Protection Designed for Kubernetes Running on OpenStack

Application-centric backup: why manifest and volume snapshots aren’t enough

A PV snapshot is a point-in-time copy of the underlying disk. The problem is that “point in time” doesn’t account for what was in memory when the snapshot ran. A PostgreSQL instance in the middle of a write transaction, a Kafka broker with unflushed segments, or a MongoDB replica with in-flight journal entries can all produce snapshots that restore to a corrupted or inconsistent state. The solution is pre-backup hooks: commands injected into application containers before the CSI snapshot triggers. A typical database hook flushes in-memory writes and sets the application to a read-only or quiesced state, holds that state while the snapshot runs, then releases it. The sequence looks like this:
  1. Pre-backup hook runs inside the database container (e.g., pg_start_backup() for PostgreSQL).
  2. CSI snapshot is taken of the PVC.
  3. Post-backup hook releases the quiesced state.
  4. Backup metadata and manifests are written to the target.
Trilio handles this through its Hooks CRD, which lets you define hook sequences declaratively and attach them to a BackupPlan. Velero supports hooks via annotations on pods, which requires every workload to be annotated correctly, an operational overhead that grows with the number of stateful applications. The second problem with manifest-only backup is that Helm releases and operator-managed workloads carry state beyond what a plain manifest export captures. A Helm release has its own metadata stored as Kubernetes secrets in the namespace. An operator manages CRDs and sub-resources that a plain kubectl get all won’t surface. Restoring the PVCs without that context produces data without a functional application on top of it. Trilio automatically discovers CRDs and associated sub-resources when a Helm release or operator instance is selected as the backup unit. Velero and native GKE tooling require manual resource filtering and annotation overhead to achieve the same scope.

Defining backup scope: namespaces, labels, Helm, and operators

Backup scope determines the boundary around what gets captured as a unit. The choice matters because it directly determines what you can restore independently. The four main approaches each have a different fit.
Scope type Granularity Best fit Operational risk
Full cluster All namespaces DR baseline, cluster migration Large, slow, not useful for per-app recovery
Namespace Everything in the namespace Self-contained single-team apps Breaks for multi-namespace apps; captures unrelated workloads
Label-based Resources matching label selectors Logical app groupings across namespaces Requires disciplined, consistent labeling practices
Helm / operator Entire release or operator instance Microservices, stateful operator apps Requires backup tool to understand Helm/operator metadata
Namespace-level backup is operationally straightforward, but it breaks in two common scenarios. First, applications that span multiple namespaces (a common pattern when shared infrastructure namespaces host services like ingress controllers or cert-manager) won’t restore completely from a single namespace backup. Second, a namespace containing multiple unrelated applications means a restore operation affects everything in that namespace, not just the target workload. Label-based selection solves the multi-namespace problem, but it depends on consistent labeling discipline across the team. Labels applied ad hoc or that drift over time as new resources are added produce incomplete backups without any obvious error. Treating a Helm release or operator instance as the atomic backup unit maps most naturally to how engineers think about applications. When you back up the payment-service Helm release, you get every Kubernetes resource that the release manages (deployments, services, configmaps, secrets, PVCs), plus the release metadata stored in the namespace. Trilio’s Helm support and operator backup capability handle this automatically at plan creation time without requiring you to enumerate individual resource types.

Kubernetes on OpenStack Data Protection & Recovery

Enjoy native OpenStack integration, documented RESTful API, and native OpenStack CLI

Restore Virtual Machines with Trilio’s One-Click Restore

Select components of your build to recover via the OpenStack dashboard or CLI

Retention policies and incremental backup efficiency

Running daily full backups of multi-terabyte persistent volumes is neither practical nor cost-effective. Forever-incremental backup solves this by taking a single full capture and then storing only the changed blocks in each subsequent backup. Trilio uses the open QCOW2 image format for this. Each incremental backup is an overlay file that references the prior base image and records only the blocks that changed. A 1 TB PV with 10 GB of actual writes between backup windows produces a roughly 10 GB backup image rather than a full 1 TB capture. The diagram below shows a five-backup chain where the retention policy is set to three. Backups 1 and 2 fall outside the retention window. Rather than deleting them and breaking the chain, Trilio commits the overlay files into the oldest retained backup (backup_to_retain), producing a new base image at the storage target. The remaining overlays reference that new base, the old chain is freed, and incremental backups continue without interruption. QCOW2 backup chain before retention is applied, showing <code srcset=backup_to_retain and the overlay files queued for merging (source)” width=”589″ height=”480″ />

QCOW2 backup chain before retention is applied, showing backup_to_retain and the overlay files queued for merging (source)

The challenge with long incremental chains is retention. You can’t simply delete old overlay files because each one is referenced by everything built on top of it. Deleting a two-week-old incremental would break every more-recent backup that depends on it. Trilio solves this with synthetic full backups: when an incremental falls outside the retention window, Trilio merges the chain into a new base image at the storage target. The source cluster is never re-read; the merge happens entirely within the backup storage. All remaining incremental overlays now reference the new base, and the old chain is freed. A retention policy in Trilio is defined declaratively in the BackupPlan CRD:
apiVersion: triliovault.trilio.io/v1
kind: Policy
metadata:
  name: production-retention
  namespace: tvk
spec:
  type: Retention
  retentionConfig:
    latest: 2
    weekly: 1
    monthly: 2
    yearly: 1
This policy keeps the two most recent backups regardless of age, plus one weekly, two monthly, and one yearly restore point. Retention schedules should match workload write patterns. High-write databases warrant shorter backup intervals and longer retention windows than read-heavy or batch workloads. Longer intervals and fewer restore points reduce storage costs without meaningfully increasing recovery risk.

Cross-cluster and multi-region recovery

Restoring into a different GKE cluster or a different region is rarely as simple as pointing a restore operation at a new target. Storage class names differ between clusters, namespace structures may not match, and network policies and resource quotas configured in the source cluster will conflict if applied verbatim in the destination. A restore that doesn’t account for these differences will fail partway through or produce a cluster in a partially configured state. Trilio’s Transformation feature handles this at restore time rather than requiring you to edit YAML manually before each restore. Transformations are defined as mapping rules in the Restore CR. You specify that the standard storage class in the source should be rewritten as premium-rwo in the destination or that a specific annotation should be stripped from all resources before they’re applied. The Trilio docs on restoring backups cover the full set of supported transform operations, including storage class remapping, namespace label rewriting, and custom resource modifications. True disaster recovery scenarios introduce an additional requirement: Backup artifacts must be fully self-contained, with no dependency on a live source cluster. Trilio’s Target Browser feature addresses this directly. Backup artifacts stored on S3 or NFS can be browsed and restored from any cluster where Trilio is installed. The source cluster doesn’t need to be reachable, and no metadata from the original backup operation needs to exist on the destination. This is what makes the difference between a backup that works in normal operations and one that works when the source cluster is actually gone. One underappreciated use case for Transformation is cross-platform migration. Because Trilio is certified across GKE, EKS, OpenShift, Rancher RKE2, and upstream Kubernetes, a backup taken on an RKE2 cluster can be restored into GKE, or a GKE workload can be migrated to OpenShift. Transformations handle the distribution-specific differences in storage classes, namespace labels, and resource annotations that would otherwise block the restore. This is particularly useful for teams evaluating platform changes or consolidating workloads across environments. Finally, restore drills run only against the production cluster don’t validate the actual recovery path. They confirm that backup artifacts exist and can be read, but they don’t confirm that a working application comes up in an isolated environment. Periodic tests into a separate GKE project or an isolated namespace, run against backups from production, are the only way to confirm that your backup strategy actually delivers recovery when you need it.  Trilio’s Continuous Restore feature extends this further by pre-staging PV data on one or more remote clusters as each new backup completes. The topology is flexible: Any cluster can act as both a source and a restore site, provided all participating clusters share access to the same backup target. The diagram below shows a five-cluster setup where TVK1, TVK2, and TVK5 share Bucket1, allowing any of them to serve as a restore destination for applications backed up on TVK1. Multi-cluster Continuous Restore topology showing shared backup targets across five Trilio instances (source)

Multi-cluster Continuous Restore topology showing shared backup targets across five Trilio instances (source)

Complete Backup Protection Designed for Kubernetes Running on OpenStack

Conclusion

GKE’s managed infrastructure handles the parts of cluster operations that used to require dedicated platform teams: node availability, control plane uptime, and infrastructure scaling. What it doesn’t handle is protecting the application state that lives inside the cluster. That responsibility sits with the engineering team, and the right backup architecture depends on getting several decisions right simultaneously.

Native Backup for GKE covers namespace-scoped workloads backed by Persistent Disk, with the cross-project restore capability now generally available. For teams with more complex requirements, including operator-managed applications, Helm releases, multi-namespace apps, or workloads requiring cross-platform portability, a purpose-built platform like Trilio fills the gaps that native tooling leaves open. Application-consistent hooks, automatic operator and Helm discovery, QCOW2-based incremental efficiency, and declarative restore transforms together address the failure modes that a manifest-plus-snapshot approach doesn’t account for.

Table Of Contents

Like This Article?

Subscribe to our LinkedIn Newsletter to receive more educational content

Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.