Vector databases are everywhere now. If you are building anything with AI—recommendation engines, semantic search, RAG pipelines—you are probably running a vector database. And if you are running it in production, you are running Milvus on Kubernetes.
Here is the problem. Your vector database holds millions of embeddings. Maybe hundreds of millions. Each one represents expensive processing—API calls to OpenAI, inference from your own models, hours of batch jobs. If you lose that data, you are not just restoring from backup. You are reprocessing everything from scratch. Days of work. Maybe weeks.
I have seen this happen. A misconfigured retention policy deletes an entire collection. A failed upgrade corrupts the index. A storage failure takes down half your cluster. And if you do not have a real backup strategy—not just volume snapshots, but application-aware protection that understands Milvus architecture—you are in trouble.
In this post, I am going to show you how to protect Milvus on Kubernetes with Trilio for Kubernetes. We will walk through the architecture, configure backups step by step, and test a full disaster recovery scenario. By the end, you will have production-ready protection for your vector database infrastructure.
Why Vector Databases Need Special Protection
Let me start with why this matters. Vector databases are not like traditional databases. They have unique characteristics that make backup more complex.
Massive Data Volumes
Embeddings are big. A single 1536-dimensional embedding (OpenAI’s ada-002 model) is about 6KB. Multiply that by 10 million documents and you are at 60GB before you even build the index. The index structures (HNSW, IVF) can be 2-3x the size of the raw embeddings.
Complex State
Vector databases maintain sophisticated index structures that must be captured consistently. If you take a volume snapshot while Milvus is writing to etcd and flushing data to MinIO, you can end up with mismatched state. Corrupted indexes. Degraded search performance. Partial data.
Distributed Architecture
Milvus runs as a distributed system—coordinators, query nodes, data nodes, etcd, MinIO, message queues. You cannot just snapshot one volume and call it a day. You need coordinated, application-consistent backups across all components.
AI Pipeline Dependencies
Your vector database sits at the center of AI workflows. If it goes down, everything downstream stops. Embedding pipelines. Model inference. Application endpoints. The blast radius is huge.
Real-World Scenarios Where You Need Backup
Let me give you some examples of what can go wrong:
Accidental deletion: Someone runs a script that deletes a collection. Happens more often than you think.
Data corruption: A failed upgrade corrupts the etcd state or index files. Now your cluster won’t start.
Ransomware: Attackers target AI infrastructure. Your embeddings are encrypted.
Compliance: HIPAA, SOC 2, SEC regulations require point-in-time recovery and long-term retention.
Disaster recovery: A cloud region goes down. You need to restore in a different region.
Testing: You want to clone production data to staging without impacting performance.
Every one of these scenarios requires more than volume snapshots. You need application-aware backup that understands Milvus.
Understanding Milvus Architecture on Kubernetes
Before we configure backups, let me explain how Milvus works on Kubernetes. If you understand the architecture, you will understand what needs protection.
Core Components
Milvus uses a disaggregated storage and compute architecture. Storage and compute are completely separated, which gives you horizontal scalability and makes the worker nodes stateless.
1. Coordinator Layer
The Coordinator is the control plane. There is one active coordinator managing the cluster. It handles three responsibilities:
Root Coordinator: Schema management, DDL operations (create collection, drop collection)
Query Coordinator: Query node management, load balancing across query nodes
Data Coordinator: Data ingestion, flushing to storage, binlog management
The coordinator state lives in etcd. This is critical. If you lose etcd, you lose your schema, collection definitions, partition information—everything.
2. Worker Nodes (Stateless)
Worker nodes are stateless because storage is separated. This is important for backup. You do not need to back up the worker nodes themselves. You need to back up what they depend on.
Query Nodes: Handle search requests, maintain in-memory indexes
Data Nodes: Process data ingestion, flush to object storage
Index Nodes: Build vector indexes asynchronously
3. Storage Layer
Persistent state lives in two places:
etcd: Metadata, schemas, cluster coordination state
Object Storage (MinIO or S3): Actual vector data, index files, binlogs, write-ahead logs (WAL)
This is what you need to protect. etcd and object storage.
4. Message Queue
Milvus uses a streaming layer (Kafka, Pulsar, or RocksMQ) for reliable data ingestion and change data capture. In most deployments, the message queue is ephemeral—you do not need to back it up. Data is committed to object storage.
Kubernetes Deployment Patterns
There are two ways to deploy Milvus on Kubernetes.
Helm Charts (Quick Start)
Helm is the fastest way to get started:
helm repo add milvus https://zilliztech.github.io/milvus-helm/
helm repo update
helm install my-milvus milvus/milvus --namespace milvus --create-namespace
This works for testing and small deployments. For production, I recommend the operator.
Milvus Operator (Production)
The Milvus Operator turns Milvus into a first-class Kubernetes resource. You define a single custom resource (CR) and the operator manages the entire lifecycle.
Here is what a production Milvus CR looks like:
apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
name: production-milvus
namespace: vector-db
spec:
mode: cluster
dependencies:
storage:
type: MinIO
etcd:
persistence:
enabled: true
storageClass: fast-ssd
components:
queryNode:
replicas: 3
resources:
limits:
cpu: "4"
memory: 16Gi
The operator handles StatefulSets, Services, ConfigMaps, and all the Kubernetes resources. This is the architecture we will protect with Trilio.
What Needs Backup Protection?
To back up Milvus completely, we need to capture:
etcd state: Metadata, schemas, collection definitions, partition information
Object storage: Vector data, index files, binlogs, WALs (everything in MinIO or S3)
Persistent Volume Claims (PVCs): If you are using persistent storage for etcd or local caching
Kubernetes resources: Milvus CR, ConfigMaps, Secrets, Services
Application context: Namespace-level dependencies
This is where Trilio’s application-aware backup becomes invaluable. Let me show you how it works.
Trilio for Kubernetes: Cloud-Native Data Protection
Trilio for Kubernetes is purpose-built for Kubernetes. It understands cloud-native patterns and provides application-consistent backup for stateful workloads.
Here is what makes Trilio different from volume-only backup tools:
1. Application-Aware Backups
Trilio captures the entire application context, not just volumes:
All Kubernetes resources (Deployments, StatefulSets, Services, CRDs)
Persistent volumes and data
Namespace-level dependencies
Resource relationships and ordering
For Milvus, this means Trilio backs up the Milvus CR, the operator configuration, etcd PVCs, MinIO PVCs, ConfigMaps, Secrets—everything. When you restore, the entire application comes back together.
2. CSI Snapshot Integration
Trilio leverages the Container Storage Interface (CSI) for efficient, storage-native snapshots:
Crash-consistent snapshots with minimal performance impact
Fast snapshot creation regardless of data volume size
Storage-efficient snapshots using native storage capabilities
Note on Incremental Backups: Trilio supports incremental backups architecturally. However, Kubernetes CSI does not currently support incremental snapshots at the storage layer. When CSI adds this capability, Trilio will automatically support incremental backups with significant storage and bandwidth savings.
3. Hooks for Application Consistency
For databases that need quiescence, Trilio supports pre-backup and post-backup hooks. You can run commands inside pods before the snapshot.
Here is an example for Milvus (we will configure this in detail later):
apiVersion: triliovault.trilio.io/v1
kind: Hook
metadata:
name: milvus-flush-hook
namespace: vector-db
spec:
pre:
execAction:
command:
- /bin/bash
- -c
- "curl -X POST http://localhost:9091/api/v1/flush && sleep 5"
ignoreFailure: false
maxRetryCount: 2
timeoutSeconds: 30
This hook flushes in-memory data to MinIO before the snapshot. Application-consistent backup.
4. AI Workload Understanding
Trilio is designed for AI/ML workloads. It captures:
Model artifacts and embeddings
Training data and datasets
Pipeline configurations
Dependency chains across microservices
Vector databases fit right into this model.
5. Flexible Storage Targets
You can back up to multiple storage backends:
S3-compatible object storage (AWS S3, MinIO, Google Cloud Storage, Azure Blob)
NFS shares for on-premises deployments
Now let me show you how to set this up.
Step-by-Step: Protecting Milvus with Trilio
I am going to walk you through a complete implementation—from installing Trilio to running a full disaster recovery test.
Prerequisites
Before we start, make sure you have:
Kubernetes cluster (1.21+) with CSI-capable storage
kubectlandhelminstalledMilvus cluster deployed (via Helm or Operator)
Storage backend for Trilio backups (S3 bucket or NFS)
Step 1: Install Trilio for Kubernetes
First, we install the Trilio operator. It manages the backup infrastructure.
# Add Trilio Helm repository
helm repo add trilio https://charts.k8strilio.com
helm repo update
# Install Trilio (control plane)
kubectl create namespace trilio-system
helm install trilio trilio/k8s-triliovault-operator \
--namespace trilio-system \
--set installTVK.applicationScope=Cluster
Verify the installation:
$ kubectl get pods -n trilio-system
NAME READY STATUS RESTARTS AGE
k8s-triliovault-control-plane-xxx 1/1 Running 0 2m
k8s-triliovault-admission-webhook-xxx 1/1 Running 0 2m
k8s-triliovault-datamover-xxx 1/1 Running 0 2m
You should see the control plane, admission webhook, and data mover pods running.
Step 2: Configure a Backup Target
Next, we configure where backups will be stored. I am using an S3 bucket in this example.
First, create the S3 credentials secret:
kubectl create secret generic s3-credentials \
--namespace trilio-system \
--from-literal=accessKey=YOUR_ACCESS_KEY \
--from-literal=secretKey=YOUR_SECRET_KEY
Now create a Target resource:
apiVersion: triliovault.trilio.io/v1
kind: Target
metadata:
name: milvus-backup-target
namespace: trilio-system
spec:
type: ObjectStore
vendor: AWS
objectStoreCredentials:
bucketName: milvus-backups-prod
region: us-east-1
credentialSecret:
name: s3-credentials
namespace: trilio-system
thresholdCapacity: 1000Gi
Apply the Target:
kubectl apply -f backup-target.yaml
Verify it:
$ kubectl get target -n trilio-system
NAME TYPE VENDOR THRESHOLD STATUS
milvus-backup-target ObjectStore AWS 1000Gi Available
Step 3: Create Backup Policies
Trilio uses separate Policy resources for scheduling and retention. Let me create both.
Schedule Policy (When to Run Backups)
apiVersion: triliovault.trilio.io/v1
kind: Policy
metadata:
name: milvus-daily-schedule
namespace: trilio-system
spec:
type: Schedule
scheduleConfig:
schedule:
- "0 2 * * *" # Daily at 2 AM
Retention Policy (How Long to Keep Backups)
apiVersion: triliovault.trilio.io/v1
kind: Policy
metadata:
name: milvus-retention-policy
namespace: trilio-system
spec:
type: Retention
retentionConfig:
latest: 7 # Keep 7 most recent backups
weekly: 4 # Keep 4 weekly backups
dayOfWeek: Sunday
monthly: 12 # Keep 12 monthly backups
dateOfMonth: 1
yearly: 2 # Keep 2 yearly backups
monthOfYear: January
Apply both policies:
kubectl apply -f schedule-policy.yaml
kubectl apply -f retention-policy.yaml
Step 4: Create Pre-Backup Hook
Before we create the backup plan, let’s define a hook to flush Milvus in-memory data to storage before snapshots.
apiVersion: triliovault.trilio.io/v1
kind: Hook
metadata:
name: milvus-flush-hook
namespace: vector-db
spec:
pre:
execAction:
command:
- /bin/bash
- -c
- |
# Flush in-memory data to object storage
curl -X POST "http://localhost:9091/api/v1/flush" && sleep 5
ignoreFailure: false
maxRetryCount: 2
timeoutSeconds: 30
Apply the hook:
kubectl apply -f milvus-flush-hook.yaml
Step 5: Create a Backup Plan
Now we define a BackupPlan for the Milvus namespace. This tells Trilio what to back up and how.
apiVersion: triliovault.trilio.io/v1
kind: BackupPlan
metadata:
name: milvus-production-backup
namespace: vector-db
spec:
backupConfig:
target:
name: milvus-backup-target
namespace: trilio-system
retentionPolicy:
name: milvus-retention-policy
namespace: trilio-system
schedulePolicy:
fullBackupPolicy:
name: milvus-daily-schedule
namespace: trilio-system
backupPlanComponents:
customSelector:
selectResources:
labelSelector:
- matchLabels:
app.kubernetes.io/instance: production-milvus
hookConfig:
mode: Sequential
hooks:
- hook:
name: milvus-flush-hook
podSelector:
labels:
- matchLabels:
app: milvus
component: datanode
This plan:
Backs up all resources with the
app.kubernetes.io/instance: production-milvuslabelUses the target, schedule, and retention policies we created
Runs the flush hook on Milvus data nodes before backup
Apply it:
kubectl apply -f backup-plan.yaml
Verify:
$ kubectl get backupplan -n vector-db
NAME TARGET STATUS
milvus-production-backup milvus-backup-target Available
Step 6: Execute and Monitor Backups
Let me show you how to trigger an on-demand backup and monitor it.
Trigger an On-Demand Backup
kubectl create -f - <
Monitor Backup Progress
Check the status:
$ kubectl get backup -n vector-db
NAME TYPE STATUS START TIME SIZE
milvus-manual-backup-20260209-1530 Full InProgress 2026-02-09T15:30:00Z -
Get detailed information:
$ kubectl describe backup milvus-manual-backup-20260209-1530 -n vector-db
Status:
Status: InProgress
Start Timestamp: 2026-02-09T15:30:00Z
Components:
Milvus CR: Completed
etcd PVC: InProgress (45% complete)
MinIO PVC: Pending
View logs:
kubectl logs -n trilio-system -l app.kubernetes.io/name=k8s-triliovault --tail=50
When the backup completes:
$ kubectl get backup -n vector-db
NAME TYPE STATUS START TIME COMPLETION TIME SIZE
milvus-manual-backup-20260209-1530 Full Available 2026-02-09T15:30:00Z 2026-02-09T15:48:00Z 142Gi
Perfect. We have a full backup. Now let me show you how to restore it.
Step 7: Perform a Restore
In a disaster scenario, we restore the Milvus cluster to a new namespace. I am going to simulate this by restoring to vector-db-dr.
Create a Restore resource:
apiVersion: triliovault.trilio.io/v1
kind: Restore
metadata:
name: milvus-disaster-recovery
namespace: vector-db-dr
spec:
source:
type: Backup
backup:
name: milvus-manual-backup-20260209-1530
namespace: vector-db
Execute the restore:
# Create the target namespace
kubectl create namespace vector-db-dr
# Apply the restore
kubectl apply -f milvus-restore.yaml
# Monitor restore progress
$ kubectl get restore -n vector-db-dr -w
NAME STATUS START TIME COMPLETION TIME
milvus-disaster-recovery InProgress 2026-02-09T16:00:00Z -
Trilio does the following:
Creates the namespace prerequisites
Restores the Milvus CR and ConfigMaps
Restores PVCs from CSI snapshots (etcd and MinIO volumes)
Restores Secrets and Services
Starts the Milvus pods
Validates pod health and readiness
After a few minutes:
$ kubectl get restore -n vector-db-dr
NAME STATUS START TIME COMPLETION TIME
milvus-disaster-recovery Completed 2026-02-09T16:00:00Z 2026-02-09T16:12:00Z
Let me verify the Milvus cluster is running:
$ kubectl get pods -n vector-db-dr
NAME READY STATUS RESTARTS AGE
production-milvus-rootcoord-0 1/1 Running 0 8m
production-milvus-querycoord-0 1/1 Running 0 8m
production-milvus-datacoord-0 1/1 Running 0 8m
production-milvus-querynode-0 1/1 Running 0 7m
production-milvus-querynode-1 1/1 Running 0 7m
production-milvus-querynode-2 1/1 Running 0 7m
production-milvus-datanode-0 1/1 Running 0 7m
production-milvus-etcd-0 1/1 Running 0 8m
production-milvus-minio-0 1/1 Running 0 8m
Perfect. The entire cluster is restored and running. All the data, indexes, collections—everything is back.
Architecture: How Trilio Integrates with Milvus
Let me explain how the backup and restore flow works under the hood.
Backup Flow
Scheduled Trigger: Trilio scheduler initiates backup based on the cron policy
Pre-Backup Hook: Trilio executes the hook to flush Milvus in-memory data to storage
Resource Discovery: Trilio identifies all Kubernetes resources matching the backup plan selector
Metadata Capture: Trilio stores the Kubernetes manifests (YAML definitions) in the backup catalog
CSI Snapshots: Trilio creates storage snapshots of etcd and MinIO PVCs via the CSI driver
Data Transfer: Trilio uploads snapshots and metadata to the S3 backup target
Post-Backup Hook: Trilio executes post-backup hooks if defined
Catalog Update: Backup metadata is indexed in Trilio’s catalog for fast restore
The entire process is coordinated. etcd and MinIO are snapshotted at the same point in time, ensuring consistency.
Restore Flow
Restore Request: You specify the backup and target namespace
Namespace Preparation: Trilio creates the namespace and prerequisites
Resource Recreation: Trilio applies Kubernetes manifests (Milvus CR, Services, ConfigMaps, Secrets)
Storage Restore: Trilio restores PVCs from CSI snapshots
Data Hydration: MinIO and etcd volumes are populated from the backup
Pod Startup: Milvus pods start and discover the restored data
Validation: Trilio monitors pod health and readiness
Complete: The restored Milvus cluster is ready for queries
The key insight here is that Trilio understands the dependencies. It restores resources in the correct order. etcd and MinIO volumes first, then the Milvus CR, then the pods.
Best Practices and Production Considerations
Let me share some best practices I have learned from running this in production.
1. Test Your Restores Regularly
The only valid backup is one you have successfully restored. I cannot stress this enough. Implement quarterly disaster recovery drills.
Here is a simple script to automate restore testing:
#!/bin/bash
# Automate restore testing
# Create test namespace
kubectl create namespace milvus-dr-test
# Restore to test namespace
kubectl apply -f milvus-restore-test.yaml
# Wait for restore to complete
kubectl wait --for=condition=Complete restore/milvus-dr-test \
-n milvus-dr-test --timeout=30m
# Run validation queries against restored cluster
./validate-milvus.sh milvus-dr-test
# Cleanup test environment
kubectl delete namespace milvus-dr-test
Run this quarterly. Document the results. Your future self will thank you.
2. Tune Backup Windows
For large Milvus deployments (>1TB):
Schedule full backups during low-traffic windows (e.g., Sunday 2 AM)
Use CSI snapshot clones for near-instant local recovery
Monitor backup duration and adjust schedule if backups start overlapping
3. Secure Backup Data
Protect your backups like you protect your production data:
Encrypt at rest: Enable S3 encryption, integrate with AWS KMS
IAM roles: Use principle of least privilege for backup access
Versioning: Enable versioning on S3 buckets for immutability
Compliance: Align retention with regulatory requirements (HIPAA, SOC 2, GDPR)
If an attacker compromises your cluster, your backups should still be safe.
4. Monitor Backup Health
Integrate backup monitoring with your observability stack. Trilio exposes Prometheus metrics:
# Port-forward to Trilio control plane
kubectl port-forward -n trilio-system \
svc/k8s-triliovault-control-plane 8080:8080
# Scrape metrics from /metrics endpoint
curl http://localhost:8080/metrics
Key metrics to alert on:
Backup failure rate: Alert if >5% of backups fail
Backup duration trends: Alert if backup duration increases >50%
Storage consumption: Alert if approaching threshold capacity
RPO violations: Alert if time since last successful backup exceeds RPO
Set up PagerDuty or Opsgenie alerts for backup failures. Treat them as critical incidents.
5. Multi-Region Disaster Recovery
For mission-critical AI services, implement geographic redundancy. Configure multiple backup targets:
# Primary backup to AWS S3 us-east-1
---
apiVersion: triliovault.trilio.io/v1
kind: Target
metadata:
name: milvus-primary-target
namespace: trilio-system
spec:
type: ObjectStore
vendor: AWS
objectStoreCredentials:
bucketName: milvus-backups-us-east-1
region: us-east-1
credentialSecret:
name: s3-us-east-1-credentials
thresholdCapacity: 2000Gi
---
# Secondary backup to AWS S3 us-west-2
apiVersion: triliovault.trilio.io/v1
kind: Target
metadata:
name: milvus-secondary-target
namespace: trilio-system
spec:
type: ObjectStore
vendor: AWS
objectStoreCredentials:
bucketName: milvus-backups-us-west-2
region: us-west-2
credentialSecret:
name: s3-us-west-2-credentials
thresholdCapacity: 2000Gi
Create separate backup plans for each target. If AWS us-east-1 goes down, you can restore from us-west-2.
Cost Optimization
Vector databases generate significant backup volumes. Let me show you how to optimize costs.
1. Leverage CSI Snapshots
CSI snapshots are storage-efficient. They use copy-on-write at the block level, so only changed blocks consume additional storage.
2. S3 Lifecycle Policies
Move old backups to cold storage. Use S3 lifecycle policies to transition backups to Glacier after 30 days:
{
"Rules": [
{
"Id": "MilvusBackupLifecycle",
"Status": "Enabled",
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER"
}
],
"Expiration": {
"Days": 730
}
}
]
}
This keeps recent backups in Standard storage for fast restore, moves older backups to Glacier for long-term retention, and deletes backups after 2 years.
3. Right-Size Retention
Balance compliance needs against storage costs. Run a cost analysis:
For a 1TB Milvus cluster with daily backups:
Daily backups (7 days): ~1.4TB (initial + daily changes)
Weekly backups (4 weeks): ~4TB
Monthly backups (12 months): ~12TB
Total: ~17.4TB
At $0.023/GB/month (S3 Standard), that is $400/month. With lifecycle policies moving old backups to Glacier ($0.004/GB/month), you can reduce this to ~$150/month.
Real-World Use Cases
Let me share three real-world examples of Milvus backup and recovery with Trilio.
Use Case 1: E-Commerce Recommendation Engine
Scenario: A retail platform uses Milvus to power product recommendations based on visual similarity. They index 100 million products with 512-dimensional image embeddings. The system must be available 24/7.
Solution:
Daily full backups at 2 AM (RPO: 24 hours)
Retention: 7 daily, 4 weekly, 12 monthly backups
Cross-region replication to AWS us-west-2 for disaster recovery
Automated restore testing in staging environment monthly
Outcome: They experienced a storage array failure during Black Friday. The etcd volume became corrupted. They restored the Milvus cluster from the backup taken 18 hours earlier. Total recovery time: 8 minutes. The recommendation engine was back online before customers noticed.
Use Case 2: Healthcare Semantic Search
Scenario: A medical research institution indexes 50 million clinical documents for semantic search. Researchers query the system to find similar patient cases and clinical trial data. HIPAA compliance requires 7-year retention and audit trails.
Solution:
Daily backups with encryption at rest (AWS KMS)
Immutable S3 backups with Object Lock (legal hold)
Automated compliance reporting for audit
Quarterly restore validation with documented procedures
Outcome: They passed a HIPAA audit with comprehensive backup documentation. The auditors requested proof of disaster recovery capability. They demonstrated a full restore in the test environment, recovering 50 million documents and all indexes in under 25 minutes.
Use Case 3: Financial RAG Pipeline
Scenario: A fintech company runs RAG (Retrieval-Augmented Generation) pipelines over 200 million financial documents. Their chatbot answers customer queries about investment products. SEC regulations require point-in-time recovery for regulatory investigations.
Solution:
Daily full backups with 5-year retention
Pre-backup hooks ensure transaction consistency
Multi-region backup storage (AWS us-east-1 and us-west-2)
Immutable backups for compliance
Outcome: They received a regulatory investigation request requiring them to restore the system to a specific date 18 months ago. They restored the Milvus cluster from the backup taken on that exact date, demonstrating compliance with SEC recovery requirements. The investigation was resolved without penalty.
Summary
Let me recap what we covered.
Vector databases like Milvus are critical infrastructure for AI applications. They hold millions of embeddings representing expensive processing. If you lose that data, you are reprocessing from scratch—days or weeks of work.
Protecting Milvus on Kubernetes requires more than volume snapshots. You need application-aware backup that understands the distributed architecture: coordinators, query nodes, data nodes, etcd, MinIO, and the Kubernetes resources that tie it all together.
Trilio for Kubernetes provides exactly this. Application-aware backups with CSI snapshot integration, hooks for consistency, and flexible storage targets. We walked through a complete implementation—from installing Trilio to running a full disaster recovery test.
The key takeaways:
Understand the architecture: Milvus uses disaggregated storage-compute. Protect etcd (metadata) and object storage (data).
Use application-aware backup: Volume snapshots are not enough. Back up the entire application context.
Test your restores: The only valid backup is one you have successfully restored. Run quarterly DR drills.
Optimize costs: Use CSI snapshots, S3 lifecycle policies, and right-size retention.
Monitor backup health: Integrate with your observability stack. Alert on failures, duration trends, and RPO violations.
If you are running Milvus in production, implement this today. Do not wait for disaster to strike.
Thank you for reading! I hope this was useful.