GKE clusters present a layered data protection challenge that goes well beyond backing up virtual machines. Production workloads span namespaces, persistent volumes provisioned across different storage classes, Helm-managed applications, and custom operators, all of which need to be captured together for a restore to be meaningful. At the same time, Google’s managed control plane handles much of the infrastructure, which creates a false sense of security: The cluster nodes may be resilient, but the application state and persistent data are the customer’s responsibility.
This article examines the core strategies and tools for protecting GKE workloads, from native Google options to third-party platforms, and covers the operational decisions that engineers need to make when designing a production-grade backup approach.
Summary of key GKE backup concepts
| Concept | Description |
|---|---|
| What GKE backup does and doesn’t protect | GKE separates infrastructure resilience from application data protection. Node pools, cluster config, and external services like Cloud SQL fall outside the standard backup scope and require separate strategies. |
| Native Backup for GKE capabilities and limitations | Google’s Backup for GKE handles Kubernetes manifests and Persistent Disk volume snapshots, but excludes Filestore NFS volumes, container images, and state held in external services like Cloud SQL or Memorystore. |
| Application-centric vs. volume-centric backup | Capturing PV snapshots while a database is actively writing produces an inconsistent state. Application-centric approaches capture manifests, metadata, and data together using pre-backup hooks to guarantee consistent recovery. |
| Backup scope: namespaces, labels, Helm, and operators | Backup granularity (namespace-wide, label-selected, or Helm/operator-scoped) directly determines whether a restore produces a working application or a collection of objects with no functional context. |
| Retention policies and incremental backup efficiency | Forever-incremental strategies using QCOW2 image chains avoid repeated full captures and reduce storage costs. Synthetic full backups merge incremental chains at retention time, keeping restore chains valid without re-capturing from the cluster. |
| Cross-cluster and multi-region recovery | Restoring to a different GKE cluster or region introduces storage class mismatches, namespace conflicts, and network policy differences that must be resolved before a restore can succeed. |
Complete Backup Protection Designed for Kubernetes Running on OpenStack
What GKE backup actually covers
The most important thing to understand about GKE’s managed infrastructure is what it does not protect. Google’s control plane manages node availability, scales node pools, and handles API server uptime, but none of that extends to your application state. If a developer runskubectl delete namespace production or a misconfigured deployment overwrites a PVC, the cluster’s built-in resilience is irrelevant—that data is gone unless you backed it up separately. Several categories of resources fall completely outside the standard backup scope: - External service state: Cloud SQL databases, Memorystore instances, and Filestore NFS volumes are not part of any cluster-level backup. Each requires its own protection strategy.
- Cluster configuration: Node pool definitions, autoscaling settings, network policies, and enabled cluster features are not captured by application-layer backup tools.
- Container images: Neither native GKE tooling nor most third-party platforms back up images from Artifact Registry or external registries.
Native Backup for GKE: capabilities, configuration, and limitations
Google’s Backup for GKE is a managed add-on that captures Kubernetes resource manifests and creates snapshots of Persistent Disk volumes. It operates through two custom resource types: BackupPlan, which defines what to back up and how often, and RestorePlan, which provides a reusable restore template pointing at a target cluster. Enabling the add-on and creating a basic plan takes only a few commands:# Enable the Backup for GKE add-on on an existing cluster gcloud container clusters update my-cluster \ --update-addons=BackupRestore=ENABLED \ --region=us-central1 # Create a BackupPlan targeting a specific namespace gcloud beta container backup-restore backup-plans create my-backup-plan \ --project=my-project \ --location=us-central1 \ --cluster=projects/my-project/locations/us-central1/clusters/my-cluster \ --selected-namespaces=production \ --include-volume-data \ --backup-retain-days=14 # Trigger a manual backup gcloud beta container backup-restore backups create my-backup-001 \ --project=my-project \ --location=us-central1 \ --backup-plan=my-backup-planThe add-on works well for straightforward use cases, specifically namespace-scoped backups of workloads using
pd.csi.storage.gke.io provisioned volumes. When your requirements go further, the limitations become relevant: - Only volumes backed by the Persistent Disk CSI driver are captured. Filestore NFS volumes, Google Cloud NetApp Volumes, and other storage backends are excluded entirely.
- Cluster-level configuration (node pools, network policies, enabled features, etc.) is not part of the backup.
- Container images are not captured.
- Cross-project restore recently reached general availability, which is a meaningful improvement for teams storing backups in isolated projects. Cross-region restore still has latency and configuration implications worth testing before relying on it for DR.
Trilio for Kubernetes component architecture (source)
Complete Backup Protection Designed for Kubernetes Running on OpenStack
Application-centric backup: why manifest and volume snapshots aren’t enough
A PV snapshot is a point-in-time copy of the underlying disk. The problem is that “point in time” doesn’t account for what was in memory when the snapshot ran. A PostgreSQL instance in the middle of a write transaction, a Kafka broker with unflushed segments, or a MongoDB replica with in-flight journal entries can all produce snapshots that restore to a corrupted or inconsistent state. The solution is pre-backup hooks: commands injected into application containers before the CSI snapshot triggers. A typical database hook flushes in-memory writes and sets the application to a read-only or quiesced state, holds that state while the snapshot runs, then releases it. The sequence looks like this:- Pre-backup hook runs inside the database container (e.g.,
pg_start_backup()for PostgreSQL). - CSI snapshot is taken of the PVC.
- Post-backup hook releases the quiesced state.
- Backup metadata and manifests are written to the target.
kubectl get all won’t surface. Restoring the PVCs without that context produces data without a functional application on top of it. Trilio automatically discovers CRDs and associated sub-resources when a Helm release or operator instance is selected as the backup unit. Velero and native GKE tooling require manual resource filtering and annotation overhead to achieve the same scope. Defining backup scope: namespaces, labels, Helm, and operators
Backup scope determines the boundary around what gets captured as a unit. The choice matters because it directly determines what you can restore independently. The four main approaches each have a different fit.| Scope type | Granularity | Best fit | Operational risk |
| Full cluster | All namespaces | DR baseline, cluster migration | Large, slow, not useful for per-app recovery |
| Namespace | Everything in the namespace | Self-contained single-team apps | Breaks for multi-namespace apps; captures unrelated workloads |
| Label-based | Resources matching label selectors | Logical app groupings across namespaces | Requires disciplined, consistent labeling practices |
| Helm / operator | Entire release or operator instance | Microservices, stateful operator apps | Requires backup tool to understand Helm/operator metadata |
payment-service Helm release, you get every Kubernetes resource that the release manages (deployments, services, configmaps, secrets, PVCs), plus the release metadata stored in the namespace. Trilio’s Helm support and operator backup capability handle this automatically at plan creation time without requiring you to enumerate individual resource types. Kubernetes on OpenStack Data Protection & Recovery
Enjoy native OpenStack integration, documented RESTful API, and native OpenStack CLI
Restore Virtual Machines with Trilio’s One-Click Restore
Select components of your build to recover via the OpenStack dashboard or CLI
Retention policies and incremental backup efficiency
Running daily full backups of multi-terabyte persistent volumes is neither practical nor cost-effective. Forever-incremental backup solves this by taking a single full capture and then storing only the changed blocks in each subsequent backup. Trilio uses the open QCOW2 image format for this. Each incremental backup is an overlay file that references the prior base image and records only the blocks that changed. A 1 TB PV with 10 GB of actual writes between backup windows produces a roughly 10 GB backup image rather than a full 1 TB capture. The diagram below shows a five-backup chain where the retention policy is set to three. Backups 1 and 2 fall outside the retention window. Rather than deleting them and breaking the chain, Trilio commits the overlay files into the oldest retained backup (backup_to_retain), producing a new base image at the storage target. The remaining overlays reference that new base, the old chain is freed, and incremental backups continue without interruption. QCOW2 backup chain before retention is applied, showing backup_to_retain and the overlay files queued for merging (source)
The challenge with long incremental chains is retention. You can’t simply delete old overlay files because each one is referenced by everything built on top of it. Deleting a two-week-old incremental would break every more-recent backup that depends on it. Trilio solves this with synthetic full backups: when an incremental falls outside the retention window, Trilio merges the chain into a new base image at the storage target. The source cluster is never re-read; the merge happens entirely within the backup storage. All remaining incremental overlays now reference the new base, and the old chain is freed. A retention policy in Trilio is defined declaratively in the BackupPlan CRD:apiVersion: triliovault.trilio.io/v1
kind: Policy
metadata:
name: production-retention
namespace: tvk
spec:
type: Retention
retentionConfig:
latest: 2
weekly: 1
monthly: 2
yearly: 1 This policy keeps the two most recent backups regardless of age, plus one weekly, two monthly, and one yearly restore point. Retention schedules should match workload write patterns. High-write databases warrant shorter backup intervals and longer retention windows than read-heavy or batch workloads. Longer intervals and fewer restore points reduce storage costs without meaningfully increasing recovery risk. Cross-cluster and multi-region recovery
Restoring into a different GKE cluster or a different region is rarely as simple as pointing a restore operation at a new target. Storage class names differ between clusters, namespace structures may not match, and network policies and resource quotas configured in the source cluster will conflict if applied verbatim in the destination. A restore that doesn’t account for these differences will fail partway through or produce a cluster in a partially configured state. Trilio’s Transformation feature handles this at restore time rather than requiring you to edit YAML manually before each restore. Transformations are defined as mapping rules in the Restore CR. You specify that thestandard storage class in the source should be rewritten as premium-rwo in the destination or that a specific annotation should be stripped from all resources before they’re applied. The Trilio docs on restoring backups cover the full set of supported transform operations, including storage class remapping, namespace label rewriting, and custom resource modifications. True disaster recovery scenarios introduce an additional requirement: Backup artifacts must be fully self-contained, with no dependency on a live source cluster. Trilio’s Target Browser feature addresses this directly. Backup artifacts stored on S3 or NFS can be browsed and restored from any cluster where Trilio is installed. The source cluster doesn’t need to be reachable, and no metadata from the original backup operation needs to exist on the destination. This is what makes the difference between a backup that works in normal operations and one that works when the source cluster is actually gone. One underappreciated use case for Transformation is cross-platform migration. Because Trilio is certified across GKE, EKS, OpenShift, Rancher RKE2, and upstream Kubernetes, a backup taken on an RKE2 cluster can be restored into GKE, or a GKE workload can be migrated to OpenShift. Transformations handle the distribution-specific differences in storage classes, namespace labels, and resource annotations that would otherwise block the restore. This is particularly useful for teams evaluating platform changes or consolidating workloads across environments. Finally, restore drills run only against the production cluster don’t validate the actual recovery path. They confirm that backup artifacts exist and can be read, but they don’t confirm that a working application comes up in an isolated environment. Periodic tests into a separate GKE project or an isolated namespace, run against backups from production, are the only way to confirm that your backup strategy actually delivers recovery when you need it. Trilio’s Continuous Restore feature extends this further by pre-staging PV data on one or more remote clusters as each new backup completes. The topology is flexible: Any cluster can act as both a source and a restore site, provided all participating clusters share access to the same backup target. The diagram below shows a five-cluster setup where TVK1, TVK2, and TVK5 share Bucket1, allowing any of them to serve as a restore destination for applications backed up on TVK1. Multi-cluster Continuous Restore topology showing shared backup targets across five Trilio instances (source)
Complete Backup Protection Designed for Kubernetes Running on OpenStack
Conclusion
GKE’s managed infrastructure handles the parts of cluster operations that used to require dedicated platform teams: node availability, control plane uptime, and infrastructure scaling. What it doesn’t handle is protecting the application state that lives inside the cluster. That responsibility sits with the engineering team, and the right backup architecture depends on getting several decisions right simultaneously.
Native Backup for GKE covers namespace-scoped workloads backed by Persistent Disk, with the cross-project restore capability now generally available. For teams with more complex requirements, including operator-managed applications, Helm releases, multi-namespace apps, or workloads requiring cross-platform portability, a purpose-built platform like Trilio fills the gaps that native tooling leaves open. Application-consistent hooks, automatic operator and Helm discovery, QCOW2-based incremental efficiency, and declarative restore transforms together address the failure modes that a manifest-plus-snapshot approach doesn’t account for.
Like This Article?
Subscribe to our LinkedIn Newsletter to receive more educational content