Reference Guide: Optimizing Backup Strategies for Red Hat OpenShift Virtualization

How to Back Up Milvus Vector Databases on Kubernetes with Trilio

Table of Contents

Vector databases are everywhere now. If you are building anything with AI—recommendation engines, semantic search, RAG pipelines—you are probably running a vector database. And if you are running it in production, you are running Milvus on Kubernetes.

Here is the problem. Your vector database holds millions of embeddings. Maybe hundreds of millions. Each one represents expensive processing—API calls to OpenAI, inference from your own models, hours of batch jobs. If you lose that data, you are not just restoring from backup. You are reprocessing everything from scratch. Days of work. Maybe weeks.

I have seen this happen. A misconfigured retention policy deletes an entire collection. A failed upgrade corrupts the index. A storage failure takes down half your cluster. And if you do not have a real backup strategy—not just volume snapshots, but application-aware protection that understands Milvus architecture—you are in trouble.

In this post, I am going to show you how to protect Milvus on Kubernetes with Trilio for Kubernetes. We will walk through the architecture, configure backups step by step, and test a full disaster recovery scenario. By the end, you will have production-ready protection for your vector database infrastructure.

Why Vector Databases Need Special Protection

Let me start with why this matters. Vector databases are not like traditional databases. They have unique characteristics that make backup more complex.

Massive Data Volumes

Embeddings are big. A single 1536-dimensional embedding (OpenAI’s ada-002 model) is about 6KB. Multiply that by 10 million documents and you are at 60GB before you even build the index. The index structures (HNSW, IVF) can be 2-3x the size of the raw embeddings.

Complex State

Vector databases maintain sophisticated index structures that must be captured consistently. If you take a volume snapshot while Milvus is writing to etcd and flushing data to MinIO, you can end up with mismatched state. Corrupted indexes. Degraded search performance. Partial data.

Distributed Architecture

Milvus runs as a distributed system—coordinators, query nodes, data nodes, etcd, MinIO, message queues. You cannot just snapshot one volume and call it a day. You need coordinated, application-consistent backups across all components.

AI Pipeline Dependencies

Your vector database sits at the center of AI workflows. If it goes down, everything downstream stops. Embedding pipelines. Model inference. Application endpoints. The blast radius is huge.

Real-World Scenarios Where You Need Backup

Let me give you some examples of what can go wrong:

  • Accidental deletion: Someone runs a script that deletes a collection. Happens more often than you think.

  • Data corruption: A failed upgrade corrupts the etcd state or index files. Now your cluster won’t start.

  • Ransomware: Attackers target AI infrastructure. Your embeddings are encrypted.

  • Compliance: HIPAA, SOC 2, SEC regulations require point-in-time recovery and long-term retention.

  • Disaster recovery: A cloud region goes down. You need to restore in a different region.

  • Testing: You want to clone production data to staging without impacting performance.

Every one of these scenarios requires more than volume snapshots. You need application-aware backup that understands Milvus.

Understanding Milvus Architecture on Kubernetes

Before we configure backups, let me explain how Milvus works on Kubernetes. If you understand the architecture, you will understand what needs protection.

Core Components

Milvus uses a disaggregated storage and compute architecture. Storage and compute are completely separated, which gives you horizontal scalability and makes the worker nodes stateless.

1. Coordinator Layer

The Coordinator is the control plane. There is one active coordinator managing the cluster. It handles three responsibilities:

  • Root Coordinator: Schema management, DDL operations (create collection, drop collection)

  • Query Coordinator: Query node management, load balancing across query nodes

  • Data Coordinator: Data ingestion, flushing to storage, binlog management

The coordinator state lives in etcd. This is critical. If you lose etcd, you lose your schema, collection definitions, partition information—everything.

2. Worker Nodes (Stateless)

Worker nodes are stateless because storage is separated. This is important for backup. You do not need to back up the worker nodes themselves. You need to back up what they depend on.

  • Query Nodes: Handle search requests, maintain in-memory indexes

  • Data Nodes: Process data ingestion, flush to object storage

  • Index Nodes: Build vector indexes asynchronously

3. Storage Layer

Persistent state lives in two places:

  • etcd: Metadata, schemas, cluster coordination state

  • Object Storage (MinIO or S3): Actual vector data, index files, binlogs, write-ahead logs (WAL)

This is what you need to protect. etcd and object storage.

4. Message Queue

Milvus uses a streaming layer (Kafka, Pulsar, or RocksMQ) for reliable data ingestion and change data capture. In most deployments, the message queue is ephemeral—you do not need to back it up. Data is committed to object storage.

Kubernetes Deployment Patterns

There are two ways to deploy Milvus on Kubernetes.

Helm Charts (Quick Start)

Helm is the fastest way to get started:

    
     helm repo add milvus https://zilliztech.github.io/milvus-helm/
helm repo update
helm install my-milvus milvus/milvus --namespace milvus --create-namespace
    
   

This works for testing and small deployments. For production, I recommend the operator.

Milvus Operator (Production)

The Milvus Operator turns Milvus into a first-class Kubernetes resource. You define a single custom resource (CR) and the operator manages the entire lifecycle.

Here is what a production Milvus CR looks like:

    
     apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
  name: production-milvus
  namespace: vector-db
spec:
  mode: cluster
  dependencies:
    storage:
      type: MinIO
    etcd:
      persistence:
        enabled: true
        storageClass: fast-ssd
  components:
    queryNode:
      replicas: 3
      resources:
        limits:
          cpu: "4"
          memory: 16Gi
    
   

The operator handles StatefulSets, Services, ConfigMaps, and all the Kubernetes resources. This is the architecture we will protect with Trilio.

What Needs Backup Protection?

To back up Milvus completely, we need to capture:

  1. etcd state: Metadata, schemas, collection definitions, partition information

  2. Object storage: Vector data, index files, binlogs, WALs (everything in MinIO or S3)

  3. Persistent Volume Claims (PVCs): If you are using persistent storage for etcd or local caching

  4. Kubernetes resources: Milvus CR, ConfigMaps, Secrets, Services

  5. Application context: Namespace-level dependencies

This is where Trilio’s application-aware backup becomes invaluable. Let me show you how it works.

Trilio for Kubernetes: Cloud-Native Data Protection

Trilio for Kubernetes is purpose-built for Kubernetes. It understands cloud-native patterns and provides application-consistent backup for stateful workloads.

Here is what makes Trilio different from volume-only backup tools:

1. Application-Aware Backups

Trilio captures the entire application context, not just volumes:

  • All Kubernetes resources (Deployments, StatefulSets, Services, CRDs)

  • Persistent volumes and data

  • Namespace-level dependencies

  • Resource relationships and ordering

For Milvus, this means Trilio backs up the Milvus CR, the operator configuration, etcd PVCs, MinIO PVCs, ConfigMaps, Secrets—everything. When you restore, the entire application comes back together.

2. CSI Snapshot Integration

Trilio leverages the Container Storage Interface (CSI) for efficient, storage-native snapshots:

  • Crash-consistent snapshots with minimal performance impact

  • Fast snapshot creation regardless of data volume size

  • Storage-efficient snapshots using native storage capabilities

Note on Incremental Backups: Trilio supports incremental backups architecturally. However, Kubernetes CSI does not currently support incremental snapshots at the storage layer. When CSI adds this capability, Trilio will automatically support incremental backups with significant storage and bandwidth savings.

3. Hooks for Application Consistency

For databases that need quiescence, Trilio supports pre-backup and post-backup hooks. You can run commands inside pods before the snapshot.

Here is an example for Milvus (we will configure this in detail later):

    
     apiVersion: triliovault.trilio.io/v1
kind: Hook
metadata:
  name: milvus-flush-hook
  namespace: vector-db
spec:
  pre:
    execAction:
      command:
        - /bin/bash
        - -c
        - "curl -X POST http://localhost:9091/api/v1/flush && sleep 5"
    ignoreFailure: false
    maxRetryCount: 2
    timeoutSeconds: 30
    
   

This hook flushes in-memory data to MinIO before the snapshot. Application-consistent backup.

4. AI Workload Understanding

Trilio is designed for AI/ML workloads. It captures:

  • Model artifacts and embeddings

  • Training data and datasets

  • Pipeline configurations

  • Dependency chains across microservices

Vector databases fit right into this model.

5. Flexible Storage Targets

You can back up to multiple storage backends:

  • S3-compatible object storage (AWS S3, MinIO, Google Cloud Storage, Azure Blob)

  • NFS shares for on-premises deployments

Now let me show you how to set this up.

Step-by-Step: Protecting Milvus with Trilio

I am going to walk you through a complete implementation—from installing Trilio to running a full disaster recovery test.

Prerequisites

Before we start, make sure you have:

  • Kubernetes cluster (1.21+) with CSI-capable storage

  • kubectl and helm installed

  • Milvus cluster deployed (via Helm or Operator)

  • Storage backend for Trilio backups (S3 bucket or NFS)

Step 1: Install Trilio for Kubernetes

First, we install the Trilio operator. It manages the backup infrastructure.

    
     # Add Trilio Helm repository
helm repo add trilio https://charts.k8strilio.com
helm repo update

# Install Trilio (control plane)
kubectl create namespace trilio-system

helm install trilio trilio/k8s-triliovault-operator \
  --namespace trilio-system \
  --set installTVK.applicationScope=Cluster
    
   

Verify the installation:

    
     $ kubectl get pods -n trilio-system

NAME                                  READY   STATUS    RESTARTS   AGE
k8s-triliovault-control-plane-xxx     1/1     Running   0          2m
k8s-triliovault-admission-webhook-xxx 1/1     Running   0          2m
k8s-triliovault-datamover-xxx         1/1     Running   0          2m
    
   

You should see the control plane, admission webhook, and data mover pods running.

Step 2: Configure a Backup Target

Next, we configure where backups will be stored. I am using an S3 bucket in this example.

First, create the S3 credentials secret:

    
     kubectl create secret generic s3-credentials \
  --namespace trilio-system \
  --from-literal=accessKey=YOUR_ACCESS_KEY \
  --from-literal=secretKey=YOUR_SECRET_KEY
    
   

Now create a Target resource:

    
     apiVersion: triliovault.trilio.io/v1
kind: Target
metadata:
  name: milvus-backup-target
  namespace: trilio-system
spec:
  type: ObjectStore
  vendor: AWS
  objectStoreCredentials:
    bucketName: milvus-backups-prod
    region: us-east-1
    credentialSecret:
      name: s3-credentials
      namespace: trilio-system
  thresholdCapacity: 1000Gi
    
   

Apply the Target:

    
     kubectl apply -f backup-target.yaml
    
   

Verify it:

    
     $ kubectl get target -n trilio-system

NAME                    TYPE          VENDOR   THRESHOLD   STATUS
milvus-backup-target    ObjectStore   AWS      1000Gi      Available
    
   

Step 3: Create Backup Policies

Trilio uses separate Policy resources for scheduling and retention. Let me create both.

Schedule Policy (When to Run Backups)
    
     apiVersion: triliovault.trilio.io/v1
kind: Policy
metadata:
  name: milvus-daily-schedule
  namespace: trilio-system
spec:
  type: Schedule
  scheduleConfig:
    schedule:
      - "0 2 * * *"  # Daily at 2 AM
    
   
Retention Policy (How Long to Keep Backups)
    
     apiVersion: triliovault.trilio.io/v1
kind: Policy
metadata:
  name: milvus-retention-policy
  namespace: trilio-system
spec:
  type: Retention
  retentionConfig:
    latest: 7        # Keep 7 most recent backups
    weekly: 4        # Keep 4 weekly backups
    dayOfWeek: Sunday
    monthly: 12      # Keep 12 monthly backups
    dateOfMonth: 1
    yearly: 2        # Keep 2 yearly backups
    monthOfYear: January
    
   

Apply both policies:

    
     kubectl apply -f schedule-policy.yaml
kubectl apply -f retention-policy.yaml
    
   

Step 4: Create Pre-Backup Hook

Before we create the backup plan, let’s define a hook to flush Milvus in-memory data to storage before snapshots.

    
     apiVersion: triliovault.trilio.io/v1
kind: Hook
metadata:
  name: milvus-flush-hook
  namespace: vector-db
spec:
  pre:
    execAction:
      command:
        - /bin/bash
        - -c
        - |
          # Flush in-memory data to object storage
          curl -X POST "http://localhost:9091/api/v1/flush" && sleep 5
    ignoreFailure: false
    maxRetryCount: 2
    timeoutSeconds: 30
    
   

Apply the hook:

    
     kubectl apply -f milvus-flush-hook.yaml
    
   

Step 5: Create a Backup Plan

Now we define a BackupPlan for the Milvus namespace. This tells Trilio what to back up and how.

    
     apiVersion: triliovault.trilio.io/v1
kind: BackupPlan
metadata:
  name: milvus-production-backup
  namespace: vector-db
spec:
  backupConfig:
    target:
      name: milvus-backup-target
      namespace: trilio-system
    retentionPolicy:
      name: milvus-retention-policy
      namespace: trilio-system
    schedulePolicy:
      fullBackupPolicy:
        name: milvus-daily-schedule
        namespace: trilio-system
  backupPlanComponents:
    customSelector:
      selectResources:
        labelSelector:
          - matchLabels:
              app.kubernetes.io/instance: production-milvus
  hookConfig:
    mode: Sequential
    hooks:
      - hook:
          name: milvus-flush-hook
        podSelector:
          labels:
            - matchLabels:
                app: milvus
                component: datanode
    
   

This plan:

  • Backs up all resources with the app.kubernetes.io/instance: production-milvus label

  • Uses the target, schedule, and retention policies we created

  • Runs the flush hook on Milvus data nodes before backup

Apply it:

    
     kubectl apply -f backup-plan.yaml
    
   

Verify:

    
     $ kubectl get backupplan -n vector-db

NAME                        TARGET                  STATUS
milvus-production-backup    milvus-backup-target    Available
    
   

Step 6: Execute and Monitor Backups

Let me show you how to trigger an on-demand backup and monitor it.

Trigger an On-Demand Backup
    
     kubectl create -f - <<EOF
apiVersion: triliovault.trilio.io/v1
kind: Backup
metadata:
  name: milvus-manual-backup-$(date +%Y%m%d-%H%M)
  namespace: vector-db
spec:
  type: Full
  backupPlan:
    name: milvus-production-backup
    namespace: vector-db
EOF
    
   
Monitor Backup Progress

Check the status:

    
     $ kubectl get backup -n vector-db

NAME                               TYPE   STATUS       START TIME             SIZE
milvus-manual-backup-20260209-1530 Full   InProgress   2026-02-09T15:30:00Z   -
    
   

Get detailed information:

    
     $ kubectl describe backup milvus-manual-backup-20260209-1530 -n vector-db

Status:
  Status: InProgress
  Start Timestamp: 2026-02-09T15:30:00Z
  Components:
    Milvus CR: Completed
    etcd PVC: InProgress (45% complete)
    MinIO PVC: Pending
    
   

View logs:

    
     kubectl logs -n trilio-system -l app.kubernetes.io/name=k8s-triliovault --tail=50
    
   

When the backup completes:

    
     $ kubectl get backup -n vector-db

NAME                               TYPE   STATUS      START TIME             COMPLETION TIME          SIZE
milvus-manual-backup-20260209-1530 Full   Available   2026-02-09T15:30:00Z   2026-02-09T15:48:00Z     142Gi
    
   

Perfect. We have a full backup. Now let me show you how to restore it.

Step 7: Perform a Restore

In a disaster scenario, we restore the Milvus cluster to a new namespace. I am going to simulate this by restoring to vector-db-dr.

Create a Restore resource:

    
     apiVersion: triliovault.trilio.io/v1
kind: Restore
metadata:
  name: milvus-disaster-recovery
  namespace: vector-db-dr
spec:
  source:
    type: Backup
    backup:
      name: milvus-manual-backup-20260209-1530
      namespace: vector-db
    
   

Execute the restore:

    
     # Create the target namespace
kubectl create namespace vector-db-dr

# Apply the restore
kubectl apply -f milvus-restore.yaml

# Monitor restore progress
$ kubectl get restore -n vector-db-dr -w

NAME                       STATUS      START TIME             COMPLETION TIME
milvus-disaster-recovery   InProgress  2026-02-09T16:00:00Z   -
    
   

Trilio does the following:

  1. Creates the namespace prerequisites

  2. Restores the Milvus CR and ConfigMaps

  3. Restores PVCs from CSI snapshots (etcd and MinIO volumes)

  4. Restores Secrets and Services

  5. Starts the Milvus pods

  6. Validates pod health and readiness

After a few minutes:

    
     $ kubectl get restore -n vector-db-dr

NAME                       STATUS      START TIME             COMPLETION TIME
milvus-disaster-recovery   Completed   2026-02-09T16:00:00Z   2026-02-09T16:12:00Z
    
   

Let me verify the Milvus cluster is running:

    
     $ kubectl get pods -n vector-db-dr

NAME                                  READY   STATUS    RESTARTS   AGE
production-milvus-rootcoord-0         1/1     Running   0          8m
production-milvus-querycoord-0        1/1     Running   0          8m
production-milvus-datacoord-0         1/1     Running   0          8m
production-milvus-querynode-0         1/1     Running   0          7m
production-milvus-querynode-1         1/1     Running   0          7m
production-milvus-querynode-2         1/1     Running   0          7m
production-milvus-datanode-0          1/1     Running   0          7m
production-milvus-etcd-0              1/1     Running   0          8m
production-milvus-minio-0             1/1     Running   0          8m
    
   

Perfect. The entire cluster is restored and running. All the data, indexes, collections—everything is back.

Architecture: How Trilio Integrates with Milvus

Let me explain how the backup and restore flow works under the hood.

Backup Flow

  1. Scheduled Trigger: Trilio scheduler initiates backup based on the cron policy

  2. Pre-Backup Hook: Trilio executes the hook to flush Milvus in-memory data to storage

  3. Resource Discovery: Trilio identifies all Kubernetes resources matching the backup plan selector

  4. Metadata Capture: Trilio stores the Kubernetes manifests (YAML definitions) in the backup catalog

  5. CSI Snapshots: Trilio creates storage snapshots of etcd and MinIO PVCs via the CSI driver

  6. Data Transfer: Trilio uploads snapshots and metadata to the S3 backup target

  7. Post-Backup Hook: Trilio executes post-backup hooks if defined

  8. Catalog Update: Backup metadata is indexed in Trilio’s catalog for fast restore

The entire process is coordinated. etcd and MinIO are snapshotted at the same point in time, ensuring consistency.

Restore Flow

  1. Restore Request: You specify the backup and target namespace

  2. Namespace Preparation: Trilio creates the namespace and prerequisites

  3. Resource Recreation: Trilio applies Kubernetes manifests (Milvus CR, Services, ConfigMaps, Secrets)

  4. Storage Restore: Trilio restores PVCs from CSI snapshots

  5. Data Hydration: MinIO and etcd volumes are populated from the backup

  6. Pod Startup: Milvus pods start and discover the restored data

  7. Validation: Trilio monitors pod health and readiness

  8. Complete: The restored Milvus cluster is ready for queries

The key insight here is that Trilio understands the dependencies. It restores resources in the correct order. etcd and MinIO volumes first, then the Milvus CR, then the pods.

Best Practices and Production Considerations

Let me share some best practices I have learned from running this in production.

1. Test Your Restores Regularly

The only valid backup is one you have successfully restored. I cannot stress this enough. Implement quarterly disaster recovery drills.

Here is a simple script to automate restore testing:

    
     #!/bin/bash
# Automate restore testing

# Create test namespace
kubectl create namespace milvus-dr-test

# Restore to test namespace
kubectl apply -f milvus-restore-test.yaml

# Wait for restore to complete
kubectl wait --for=condition=Complete restore/milvus-dr-test \
  -n milvus-dr-test --timeout=30m

# Run validation queries against restored cluster
./validate-milvus.sh milvus-dr-test

# Cleanup test environment
kubectl delete namespace milvus-dr-test
    
   

Run this quarterly. Document the results. Your future self will thank you.

2. Tune Backup Windows

For large Milvus deployments (>1TB):

  • Schedule full backups during low-traffic windows (e.g., Sunday 2 AM)

  • Use CSI snapshot clones for near-instant local recovery

  • Monitor backup duration and adjust schedule if backups start overlapping

3. Secure Backup Data

Protect your backups like you protect your production data:

  • Encrypt at rest: Enable S3 encryption, integrate with AWS KMS

  • IAM roles: Use principle of least privilege for backup access

  • Versioning: Enable versioning on S3 buckets for immutability

  • Compliance: Align retention with regulatory requirements (HIPAA, SOC 2, GDPR)

If an attacker compromises your cluster, your backups should still be safe.

4. Monitor Backup Health

Integrate backup monitoring with your observability stack. Trilio exposes Prometheus metrics:

    
     # Port-forward to Trilio control plane
kubectl port-forward -n trilio-system \
  svc/k8s-triliovault-control-plane 8080:8080

# Scrape metrics from /metrics endpoint
curl http://localhost:8080/metrics
    
   

Key metrics to alert on:

  • Backup failure rate: Alert if >5% of backups fail

  • Backup duration trends: Alert if backup duration increases >50%

  • Storage consumption: Alert if approaching threshold capacity

  • RPO violations: Alert if time since last successful backup exceeds RPO

Set up PagerDuty or Opsgenie alerts for backup failures. Treat them as critical incidents.

5. Multi-Region Disaster Recovery

For mission-critical AI services, implement geographic redundancy. Configure multiple backup targets:

    
     # Primary backup to AWS S3 us-east-1
---
apiVersion: triliovault.trilio.io/v1
kind: Target
metadata:
  name: milvus-primary-target
  namespace: trilio-system
spec:
  type: ObjectStore
  vendor: AWS
  objectStoreCredentials:
    bucketName: milvus-backups-us-east-1
    region: us-east-1
    credentialSecret:
      name: s3-us-east-1-credentials
  thresholdCapacity: 2000Gi

---
# Secondary backup to AWS S3 us-west-2
apiVersion: triliovault.trilio.io/v1
kind: Target
metadata:
  name: milvus-secondary-target
  namespace: trilio-system
spec:
  type: ObjectStore
  vendor: AWS
  objectStoreCredentials:
    bucketName: milvus-backups-us-west-2
    region: us-west-2
    credentialSecret:
      name: s3-us-west-2-credentials
  thresholdCapacity: 2000Gi
    
   

Create separate backup plans for each target. If AWS us-east-1 goes down, you can restore from us-west-2.

Cost Optimization

Vector databases generate significant backup volumes. Let me show you how to optimize costs.

1. Leverage CSI Snapshots

CSI snapshots are storage-efficient. They use copy-on-write at the block level, so only changed blocks consume additional storage.

2. S3 Lifecycle Policies

Move old backups to cold storage. Use S3 lifecycle policies to transition backups to Glacier after 30 days:

    
     {
  "Rules": [
    {
      "Id": "MilvusBackupLifecycle",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 730
      }
    }
  ]
}
    
   

This keeps recent backups in Standard storage for fast restore, moves older backups to Glacier for long-term retention, and deletes backups after 2 years.

3. Right-Size Retention

Balance compliance needs against storage costs. Run a cost analysis:

For a 1TB Milvus cluster with daily backups:

  • Daily backups (7 days): ~1.4TB (initial + daily changes)

  • Weekly backups (4 weeks): ~4TB

  • Monthly backups (12 months): ~12TB

  • Total: ~17.4TB

At $0.023/GB/month (S3 Standard), that is $400/month. With lifecycle policies moving old backups to Glacier ($0.004/GB/month), you can reduce this to ~$150/month.

Real-World Use Cases

Let me share three real-world examples of Milvus backup and recovery with Trilio.

Use Case 1: E-Commerce Recommendation Engine

Scenario: A retail platform uses Milvus to power product recommendations based on visual similarity. They index 100 million products with 512-dimensional image embeddings. The system must be available 24/7.

Solution:

  • Daily full backups at 2 AM (RPO: 24 hours)

  • Retention: 7 daily, 4 weekly, 12 monthly backups

  • Cross-region replication to AWS us-west-2 for disaster recovery

  • Automated restore testing in staging environment monthly

Outcome: They experienced a storage array failure during Black Friday. The etcd volume became corrupted. They restored the Milvus cluster from the backup taken 18 hours earlier. Total recovery time: 8 minutes. The recommendation engine was back online before customers noticed.

Use Case 2: Healthcare Semantic Search

Scenario: A medical research institution indexes 50 million clinical documents for semantic search. Researchers query the system to find similar patient cases and clinical trial data. HIPAA compliance requires 7-year retention and audit trails.

Solution:

  • Daily backups with encryption at rest (AWS KMS)

  • Immutable S3 backups with Object Lock (legal hold)

  • Automated compliance reporting for audit

  • Quarterly restore validation with documented procedures

Outcome: They passed a HIPAA audit with comprehensive backup documentation. The auditors requested proof of disaster recovery capability. They demonstrated a full restore in the test environment, recovering 50 million documents and all indexes in under 25 minutes.

Use Case 3: Financial RAG Pipeline

Scenario: A fintech company runs RAG (Retrieval-Augmented Generation) pipelines over 200 million financial documents. Their chatbot answers customer queries about investment products. SEC regulations require point-in-time recovery for regulatory investigations.

Solution:

  • Daily full backups with 5-year retention

  • Pre-backup hooks ensure transaction consistency

  • Multi-region backup storage (AWS us-east-1 and us-west-2)

  • Immutable backups for compliance

Outcome: They received a regulatory investigation request requiring them to restore the system to a specific date 18 months ago. They restored the Milvus cluster from the backup taken on that exact date, demonstrating compliance with SEC recovery requirements. The investigation was resolved without penalty.

Summary

Let me recap what we covered.

Vector databases like Milvus are critical infrastructure for AI applications. They hold millions of embeddings representing expensive processing. If you lose that data, you are reprocessing from scratch—days or weeks of work.

Protecting Milvus on Kubernetes requires more than volume snapshots. You need application-aware backup that understands the distributed architecture: coordinators, query nodes, data nodes, etcd, MinIO, and the Kubernetes resources that tie it all together.

Trilio for Kubernetes provides exactly this. Application-aware backups with CSI snapshot integration, hooks for consistency, and flexible storage targets. We walked through a complete implementation—from installing Trilio to running a full disaster recovery test.

The key takeaways:

  1. Understand the architecture: Milvus uses disaggregated storage-compute. Protect etcd (metadata) and object storage (data).

  2. Use application-aware backup: Volume snapshots are not enough. Back up the entire application context.

  3. Test your restores: The only valid backup is one you have successfully restored. Run quarterly DR drills.

  4. Optimize costs: Use CSI snapshots, S3 lifecycle policies, and right-size retention.

  5. Monitor backup health: Integrate with your observability stack. Alert on failures, duration trends, and RPO violations.

If you are running Milvus in production, implement this today. Do not wait for disaster to strike.

Thank you for reading! I hope this was useful.

Sharing

Author

Picture of Rodolfo Casas

Rodolfo Casas

Rodolfo Casás is a Solution Architect from Madrid working for Trilio with a special focus on cloud-native computing, hybrid cloud strategies, telco and data protection.

Related Articles

Copyright © 2026 by Trilio

Powered by Trilio

Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.