Whitepaper: Trilio Site Recovery (TSR) — DR for Kubernetes-native VMs

Protecting Red Hat OpenShift AI with Trilio for Kubernetes: a hands-on lab

Table of Contents

A few weeks ago I was on a call with a financial services customer who had moved a credit-decisioning model into production on Red Hat OpenShift AI. They were happy with the platform. They were less happy with the answer they had for a question their risk officer had just asked:

“If an attacker encrypts the cluster tomorrow, what do we need to bring back to be inference-ready by Monday morning?”

The team started listing the obvious things — the model artifact, the serving endpoint. Then it got harder. The model registry that proves which version is in production. The pipeline run history the auditor had asked to see. The Jupyter workbenches their data scientists were actively using. The S3 credentials that connect all of it together.

It became clear pretty quickly that “back up the cluster” doesn’t mean what it used to. RHOAI is more than a model server — it’s a full MLOps platform, and each component has its own state. If you treat it like a stateless web app, the things regulators care most about are the things you lose first.

In this post I’m going to walk through what end-to-end protection of an RHOAI 3.x DSProject actually looks like with Trilio for Kubernetes. I’ll build a real workload, take a backup, blow the workload away, and bring it back from the backup — same notes in the workbench, same registered model, same pipeline run history, same predictions coming out of both a classical scikit-learn model and a vLLM-served generative model.

Our goal here is to show what a “good” Red Hat OpenShift AI protection plan covers — and to surface the gotchas I hit so you don’t have to.

Prerequisites

If you want to follow along on your own cluster, you’ll need:

  • OpenShift 4.18 or later (I used 4.20)
  • Red Hat OpenShift AI 3.3 or later, with the Dashboard, KServe, Data Science Pipelines, Workbenches, and Model Registry components enabled in the DataScienceCluster
  • A CSI storage class that supports volume snapshots (I’m using OpenShift Data Foundation with ocs-storagecluster-ceph-rbd and ocs-storagecluster-cephfs)
  • NooBaa (ships with ODF) — or any S3-compatible object store you can point KServe at
  • Trilio for Kubernetes 5.3 + installed in trilio-system, with a valid license and a healthy backup Target (mine is NFS-backed)
  • oc CLI logged in as a cluster admin

A real lab — not a minikube. If you don’t have ODF, any storage class with CSI snapshots will work, but the storage class names in the examples below will need adjustment.

What we're going to build

Two namespaces:

  • rhoai-demo — the application namespace. Hosts the workbench, the pipeline server, the KServe inference services, and the S3 data connection.
  • rhoai-model-registries — where the RHOAI Model Registry instance lives. The Model Registry component in RHOAI 3.x binds every ModelRegistry instance to one specific namespace via the DataScienceCluster config, so this is a separate scope.

Inside those two namespaces, we’ll have:

  • A Jupyter workbench (Notebook CR + StatefulSet + 20 GiB cephfs PVC)
  • A Data Science Pipelines server (DSPA + 7 pods + MariaDB)
  • A registered model (ModelRegistry instance + its MariaDB backend + REST endpoint)
  • Two InferenceService resources: one classical (sklearn iris on MLServer), one generative (facebook/opt-125m on vLLM)
  • An ObjectBucketClaim providing an S3 bucket on NooBaa, plus the data-connection Secret that ties KServe and DSPA to it

One ClusterBackupPlan in Trilio will cover both namespaces in a single consistent backup. That’s the plan.

Step 1 — Enabling OpenShift AI Model Registry and creating the project

By default, the Model Registry component in RHOAI is set to Removed in the DataScienceCluster. We need to turn it on

    
     oc patch datasciencecluster default-dsc --type=merge -p '{
  "spec":{"components":{"modelregistry":{
    "managementState":"Managed",
    "registriesNamespace":"rhoai-model-registries"
  }}}
}'

    
   

The Model Registry component is Removed by default in the DataScienceCluster for a couple of practical reasons. First, it only graduated from Tech Preview to GA in RHOAI 3.x — Red Hat keeps recently-promoted components opt-in for a release or two so customers consciously choose them rather than waking up to new operators after a routine upgrade. Second, the operator alone isn’t enough: a working Model Registry needs an external MySQL or MariaDB database that the DSC doesn’t provision for you. Defaulting to Removed avoids leaving an idle operator running in clusters where nobody has provided the backing database. Turning it on is a one-line patch, fully reversible, and only enables the operator — you still create the ModelRegistry instance and its database yourself in Step 4.
This takes a couple of minutes to reconcile. When it’s done, the component reports Ready:

    
     $ oc get datasciencecluster default-dsc -o jsonpath='{.status.conditions[?(@.type=="ModelRegistryReady")].status}'
True

    
   

The operator also auto-creates the rhoai-model-registries namespace. Good. Now I create the application namespace and label it so it shows up in the RHOAI dashboard:

    
     oc create namespace rhoai-demo
oc label namespace rhoai-demo opendatahub.io/dashboard=true modelmesh-enabled=false --overwrite
oc annotate namespace rhoai-demo \
  openshift.io/display-name="RHOAI Demo" \
  openshift.io/description="Trilio + RHOAI demo project" --overwrite

    
   

If I refresh the RHOAI dashboard, RHOAI Demo shows up under Data Science Projects. Empty, but visible.

Step 2 — S3, the data connection, and why I'm using the internal NooBaa endpoint

KServe and Data Science Pipelines both need an S3 endpoint to read models and write artifacts. I want my models stored somewhere durable and separate from any single pod’s PVC, so I’m using NooBaa via an ObjectBucketClaim:

    
     apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
  name: rhoai-demo-bucket
  namespace: rhoai-demo
spec:
  generateBucketName: rhoai-demo
  storageClassName: openshift-storage.noobaa.io

    
   

After applying, the OBC moves to Bound and NooBaa generates two artifacts in my namespace: a Secret with AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, and a ConfigMap with the bucket name and endpoint:

    
     $ oc get obc -n rhoai-demo
NAME                STORAGE-CLASS                 PHASE   AGE
rhoai-demo-bucket   openshift-storage.noobaa.io   Bound   30s

    
   

Now I have to create an RHOAI-format data connection Secret that pulls those credentials together. Important detail: the AWS_S3_ENDPOINT here is the internal NooBaa service, not the external route.

    
     apiVersion: v1
kind: Secret
metadata:
  name: aws-connection-rhoai-demo-bucket
  namespace: rhoai-demo
  labels:
    opendatahub.io/dashboard: "true"
    opendatahub.io/managed: "true"
  annotations:
    opendatahub.io/connection-type: s3
    openshift.io/display-name: rhoai-demo-bucket
type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: <key from the OBC's Secret>
  AWS_SECRET_ACCESS_KEY: <secret from the OBC's Secret>
  AWS_DEFAULT_REGION: us-east-1
  AWS_S3_BUCKET: <bucket name from the OBC's ConfigMap>
  AWS_S3_ENDPOINT: https://s3.openshift-storage.svc

    
   

Why the internal endpoint and not the external route? I tried the external route first (s3-openshift-storage.apps.<cluster>...) and the DSPA pod immediately blew up with:

    
     tls: failed to verify certificate: x509: certificate signed by unknown authority

    
   

The external NooBaa route is signed by the cluster’s default ingress cert, which is self-signed and not in any pod’s trust chain by default. The internal service endpoint uses the OpenShift Service CA, which every pod automatically trusts. Switch to the internal endpoint and the problem disappears.

Step 3 — Deploying the pipeline server and running a real pipeline

Now I deploy a Data Science Pipelines Application (DSPA). This will give me the apiserver, the workflow controller, the persistence agent, the scheduled-workflow controller, the metadata services, and a MariaDB to hold pipeline definitions and run history.

    
     apiVersion: datasciencepipelinesapplications.opendatahub.io/v1
kind: DataSciencePipelinesApplication
metadata:
  name: dspa
  namespace: rhoai-demo
spec:
  apiServer:
    deploy: true
    enableSamplePipeline: false
  persistenceAgent:
    deploy: true
  scheduledWorkflow:
    deploy: true
  objectStorage:
    externalStorage:
      bucket: <bucket name>
      host: s3.openshift-storage.svc
      port: "443"
      scheme: https
      region: us-east-1
      secure: true
      s3CredentialsSecret:
        accessKey: AWS_ACCESS_KEY_ID
        secretKey: AWS_SECRET_ACCESS_KEY
        secretName: aws-connection-rhoai-demo-bucket
  database:
    disableHealthCheck: false
    mariaDB:
      deploy: true
      pipelineDBName: mlpipeline
      storageClassName: ocs-storagecluster-ceph-rbd

    
   

Couple of minutes later I have seven pods running:

    
     $ oc get pods -n rhoai-demo
NAME                                                    READY   STATUS    RESTARTS   AGE
ds-pipeline-dspa-697687986b-988tx                       2/2     Running   0          2m
ds-pipeline-metadata-envoy-dspa-c5c7549d6-9tmjl         2/2     Running   0          2m
ds-pipeline-metadata-grpc-dspa-b59b76b68-r86zw          1/1     Running   0          2m
ds-pipeline-persistenceagent-dspa-dc758b4f7-6xnmv       1/1     Running   0          2m
ds-pipeline-scheduledworkflow-dspa-66788677-kbdld       1/1     Running   0          2m
ds-pipeline-workflow-controller-dspa-694d4b8c9d-sc6tc   1/1     Running   0          2m
mariadb-dspa-54fb668576-bkmc9                           1/1     Running   0          3m

    
   

Now I want a real pipeline in there — not an empty server. I write a tiny KFP v2 pipeline that trains the iris dataset:

    
     from kfp import dsl, compiler

@dsl.component(
    base_image="registry.access.redhat.com/ubi9/python-311:latest",
    packages_to_install=["scikit-learn==1.5.0", "joblib"]
)
def train_iris(accuracy_threshold: float = 0.9) -> str:
    from sklearn.datasets import load_iris
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import train_test_split
    X, y = load_iris(return_X_y=True)
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)
    clf = RandomForestClassifier(n_estimators=10, random_state=42).fit(Xtr, ytr)
    return f"trained-acc-{clf.score(Xte, yte):.4f}"

@dsl.component(base_image="registry.access.redhat.com/ubi9/python-311:latest")
def register(model_info: str) -> str:
    return f"registered:{model_info}"

@dsl.pipeline(name="iris-classifier-training")
def iris_pipeline(accuracy_threshold: float = 0.9):
    t = train_iris(accuracy_threshold=accuracy_threshold)
    register(model_info=t.output)

compiler.Compiler().compile(iris_pipeline, "/tmp/iris_pipeline.yaml")

    
   

I run that inside a Kubernetes Job, then POST the resulting YAML to the DSPA apiserver via its multipart upload endpoint. Two things tripped me up here:

  1. DSPA’s port 8888 serves HTTPS, not HTTP, even though the port name is http. The podToPodTLS setting flips on by default. Easy fix — https:// and curl -k.

  2. The DSPA NetworkPolicy only allows ingress on 8888 from pods labeled opendatahub.io/workbenches: "true", plus a few other DSPA component selectors. My Job pod needed that label or the connection hangs.

With those two fixes, the upload and run trigger work:

    
     $ oc get workflow -n rhoai-demo
NAME                             PHASE       PROGRESS
iris-classifier-training-cxxwr   Succeeded   5/5

    
   

The run shows up in the RHOAI Pipelines tab, complete with the parameters I passed and the step-by-step status.

Step 4 — Model Registry: MariaDB backend and registering a model via the API

The Model Registry component in RHOAI 3.x installs the operator and the Model Catalog. It does not create a ModelRegistry instance for you, and the registry server doesn’t ship with its own database — you bring your own MySQL/MariaDB.

I deploy a small MariaDB, then create the ModelRegistry CR that points at it:

    
     apiVersion: modelregistry.opendatahub.io/v1beta1
kind: ModelRegistry
metadata:
  name: rhoai-registry
  namespace: rhoai-model-registries
spec:
  grpc: {}
  rest: {}
  oauthProxy: {}
  mysql:
    host: model-registry-db
    port: 3306
    database: model_registry
    username: mlmduser
    passwordSecret:
      key: database-password
      name: model-registry-db
    skipDBCreation: false

    
   

The operator reconciles this into a Deployment, Service, Route, and OAuth proxy in front of the REST and gRPC ports. I want to register the iris model I trained in step 3, so I need to talk to the registry’s API.

The external route is OAuth-protected and meant for human use. For machine-to-machine registration I can hit the registry pod’s :8080 directly — but only after relaxing the NetworkPolicy that the operator installs:

    
     apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-same-namespace-to-registry-8080
  namespace: rhoai-model-registries
spec:
  podSelector:
    matchLabels:
      app: rhoai-registry
      app.kubernetes.io/name: rhoai-registry
      component: model-registry
  policyTypes: [Ingress]
  ingress:
  - from:
    - podSelector: {}
    ports:
    - {protocol: TCP, port: 8080}

    
   

Plus a tiny additional Service that targets port 8080 (the registry’s default Service only exposes 8443):

    
     apiVersion: v1
kind: Service
metadata:
  name: rhoai-registry-internal
  namespace: rhoai-model-registries
spec:
  type: ClusterIP
  ports:
  - {port: 8080, targetPort: 8080, name: http-api}
  selector:
    app: rhoai-registry
    app.kubernetes.io/name: rhoai-registry
    component: model-registry

    
   

Now I can POST to the registry. The Kubeflow Model Registry REST API has a small quirk worth noting: child-resource IDs go in the body, not just the URL path:

    
     # Create the registered model
curl -s -X POST http://rhoai-registry-internal.rhoai-model-registries.svc:8080/api/model_registry/v1alpha3/registered_models \
  -H "Content-Type: application/json" \
  -d '{"name":"iris-classifier","description":"RandomForest iris classifier for the Trilio demo","owner":"rodolfo","state":"LIVE"}'

# Create a version (registeredModelId in body)
curl -s -X POST http://.../api/model_registry/v1alpha3/model_versions \
  -d '{"registeredModelId":"1","name":"v1.0.0","author":"rodolfo","state":"LIVE"}'

# Create the artifact pointer (modelVersionId in body)
curl -s -X POST http://.../api/model_registry/v1alpha3/model_versions/2/artifacts \
  -d '{"modelVersionId":"2","name":"iris-model-artifact","modelFormatName":"sklearn","storageKey":"aws-connection-rhoai-demo-bucket","storagePath":"models/iris","uri":"s3://<bucket>/models/iris","artifactType":"model-artifact","state":"LIVE"}'

    
   

I open the registry in the RHOAI dashboard and there’s my iris-classifier v1.0.0, pointing at the S3 path where the actual model lives.

Step 5 — KServe: serving classical and generative in the same namespace

Two ServingRuntime resources, two InferenceService resources. Same namespace.

For the classical model I’m using MLServer:

    
     apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: mlserver-runtime
  namespace: rhoai-demo
spec:
  containers:
  - name: kserve-container
    image: registry.redhat.io/rhoai/odh-mlserver-rhel9@sha256:bd44d...
    env:
    - {name: MLSERVER_MODEL_NAME, value: '{{.Name}}'}
    - {name: MLSERVER_HTTP_PORT, value: "8080"}
    - {name: MODELS_DIR, value: /mnt/models}
    ports: [{containerPort: 8080, protocol: TCP}]
  supportedModelFormats:
  - {name: sklearn, version: "1", autoSelect: true}

    
   

For the generative model I’m using vLLM, specifically the x86 CPU build that ships in the RHOAI catalog. And this is where I lost an hour the first time around. With the default vLLM environment, the CPU runtime tries to allocate roughly 31 GiB for its KV cache, which is fine if you have that kind of memory budget — and not fine if you don’t. My pods got OOMKilled within 30 seconds, every time.

The fix is one environment variable on the ServingRuntime:

    
     apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: vllm-cpu-x86-runtime
  namespace: rhoai-demo
spec:
  containers:
  - name: kserve-container
    image: registry.redhat.io/rhaiis/vllm-cpu-rhel9@sha256:f05e7...
    command: [python, -m, vllm.entrypoints.openai.api_server]
    args:
    - --port=8080
    - --model=/mnt/models
    - --served-model-name={{.Name}}
    - --dtype=float32
    - --max-model-len=512
    env:
    - {name: HF_HOME, value: /tmp/hf_home}
    - {name: VLLM_CPU_KVCACHE_SPACE, value: "1"}    # critical
    resources:
      requests: {cpu: "1", memory: 5Gi}
      limits:   {cpu: "2", memory: 8Gi}

    
   

VLLM_CPU_KVCACHE_SPACE=1 caps the KV cache at 1 GiB, which is plenty for facebook/opt-125m and any other small model you’d put on CPU for a lab. Important detail — set this on the ServingRuntime, not directly on the rendered Deployment. KServe re-renders the Deployment from the runtime spec, so any direct patch gets clobbered on the next reconcile.

I preload facebook/opt-125m into the S3 bucket via a Job that snapshots from HuggingFace and uploads to NooBaa, then create the two InferenceService resources:

    
     apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: iris
  namespace: rhoai-demo
spec:
  predictor:
    minReplicas: 1
    model:
      modelFormat: {name: sklearn, version: "1"}
      runtime: mlserver-runtime
      storage:
        key: aws-connection-rhoai-demo-bucket
        path: models/iris
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: opt125m
  namespace: rhoai-demo
spec:
  predictor:
    minReplicas: 1
    model:
      modelFormat: {name: vLLM}
      runtime: vllm-cpu-x86-runtime
      storage:
        key: aws-connection-rhoai-demo-bucket
        path: models/opt-125m

    
   

A couple of minutes later, both are serving. Quick test against iris:

    
     $ curl -s -X POST http://iris-predictor.rhoai-demo.svc.cluster.local:8080/v2/models/iris/infer \
    -H "Content-Type: application/json" \
    -d '{"inputs":[{"name":"predict","shape":[2,4],"datatype":"FP32","data":[[5.1,3.5,1.4,0.2],[6.7,3.1,5.6,2.4]]}]}'
{"model_name":"iris","model_version":"v1.0.0","outputs":[{"name":"predict","shape":[2,1],"datatype":"INT64","data":[0,2]}]}

    
   

Class 0 (setosa) and class 2 (virginica). Correct.

And vLLM:

    
     $ curl -s -X POST http://opt125m-predictor.rhoai-demo.svc.cluster.local:8080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"opt125m","prompt":"Trilio protects RHOAI by","max_tokens":20}'
{"id":"cmpl-912b...","model":"opt125m","choices":[{"text":" providing the best data collection, quality, and scalability for RHOAI workloads..."}]}

    
   

A small model hallucinates a small answer, but the plumbing works.

Important note on the headless service: if you try curl iris-predictor.<ns>.svc:80/... and get connection refused — that’s expected. KServe RawDeployment mode uses a headless Service (clusterIP: None), so DNS resolves directly to pod IPs and port mapping (80 → 8080) doesn’t apply. Always hit :8080 directly.

Step 6 — Creating a Trilio backup

Now for the part this whole post is about. I create one ClusterBackupPlan that covers both namespaces in one consistent backup:

    
     apiVersion: triliovault.trilio.io/v1
kind: ClusterBackupPlan
metadata:
  name: rhoai-demo-plan
spec:
  backupConfig:
    target:
      name: tvk-target
      namespace: trilio-system
  backupComponents:
  - namespace: rhoai-demo
  - namespace: rhoai-model-registries

    
   

Then trigger a Full backup:

    
     oc apply -f - <<EOF
apiVersion: triliovault.trilio.io/v1
kind: ClusterBackup
metadata:
  name: rhoai-demo-backup-baseline
spec:
  clusterBackupPlan:
    name: rhoai-demo-plan
  type: Full
EOF

    
   

About six minutes later:

    
     $ oc get clusterbackup
NAME                         TYPE   STATUS      DATA SIZE    DURATION
rhoai-demo-backup-baseline   Full   Available   2750685184   6m22s

    
   

As Trilio runs inside the cluster, THAT´S WHY we can use CRDs, and oc apply

But we also have a UI

And also an OpenShift console plugin, if you don’t want to leave the OCP console moving forward, or to quickly check everything looks good

2.75 GiB of consistent, restorable state across the two namespaces. That includes the workbench’s PVC content, the MariaDB databases for both DSPA and the model registry, every CR (workbench Notebook, DSPA, InferenceService, ServingRuntime, ModelRegistry, OBC, Secrets, NetworkPolicies, etc.), and even the container image references for an air-gapped restore scenario.

Now I open the workbench in JupyterLab, add a few cells of notes — “this is the state I want to be able to come back to”

Save the notebook, and trigger an Incremental:

    
     apiVersion: triliovault.trilio.io/v1
kind: ClusterBackup
metadata:
  name: rhoai-demo-backup-with-notes
spec:
  clusterBackupPlan:
    name: rhoai-demo-plan
  type: Incremental

    
   
    
     $ oc get clusterbackup
NAME                              TYPE          STATUS      DATA SIZE
rhoai-demo-backup-baseline        Full          Available   2750685184
rhoai-demo-backup-with-notes      Incremental   Available   25219072

    
   

25 MB. That’s the delta — the changed blocks in the workbench PVC plus the few CR updates that happened since the Full. Two restore points, two business decisions: “back to before I added the notes” or “back to right now.”

Step 7 — The disaster

I want to show this is a real recovery, not a no-op. So I’m going to delete everything that defines the workload — keeping the namespaces themselves and the OBC alive (more on why in a moment):

    
     oc delete -n rhoai-demo notebook,datasciencepipelinesapplication,inferenceservice,servingruntime,workflow,statefulset --all
oc delete pvc -n rhoai-demo --all
oc delete -n rhoai-model-registries modelregistry rhoai-registry
oc delete -n rhoai-model-registries deployment model-registry-db svc model-registry-db pvc model-registry-db

    
   

After that:

    
     $ oc get notebook,inferenceservice,servingruntime,dspa,pvc -n rhoai-demo
No resources found in rhoai-demo namespace.

$ oc get modelregistry.modelregistry.opendatahub.io -n rhoai-model-registries
No resources found in rhoai-model-registries namespace.

    
   

The workbench is gone. The Pipelines tab in the RHOAI dashboard says “No pipelines deployed.” The Models tab is empty. The Model Registry section says no registries available.

Why I didn’t oc delete namespace: the ObjectBucketClaim has a default reclaim policy of Delete. If I delete the namespace, the OBC cascades to NooBaa, NooBaa deletes the bucket, and the iris and opt-125m model files in S3 are gone with it. The restore would bring back the InferenceService CRs but they’d fail to load because the artifacts they reference would not exist. A surgical workload delete keeps the bucket alive and the model files intact. For real production DR you’d want external S3 that’s outside the namespace lifecycle entirely — that’s a separate conversation.

Step 8 — Restore in under five minutes

One ClusterRestore brings it all back:

    
     apiVersion: triliovault.trilio.io/v1
kind: ClusterRestore
metadata:
  name: rhoai-restore-pit
spec:
  source:
    type: ClusterBackup
    clusterBackup:
      name: rhoai-demo-backup-with-notes
  components:
  - sourceNamespace: rhoai-demo
    restoreNamespace: rhoai-demo
    restoreConfig:
      restoreFlags:
        skipIfAlreadyExists: true
        useOCPNamespaceUIDRange: true
  - sourceNamespace: rhoai-model-registries
    restoreNamespace: rhoai-model-registries
    restoreConfig:
      restoreFlags:
        skipIfAlreadyExists: true
        useOCPNamespaceUIDRange: true
      excludeResourceSelector:
        gvkSelector:
        - groupVersionKind:
            kind: PersistentVolumeClaim
            version: v1
          objects: [model-catalog-postgres]
        - groupVersionKind:
            kind: Deployment
            group: apps
            version: v1
          objects: [model-catalog, model-catalog-postgres]

    
   

Two flags and one exclude rule worth explaining:

  • skipIfAlreadyExists: true — OpenShift and RHOAI auto-inject a bunch of resources into every namespace (default ServiceAccounts, RBAC, the trusted-CA-bundle ConfigMap). They survive the delete and will be present when restore runs. This flag tells Trilio to leave them alone instead of choking on the conflict.

  • useOCPNamespaceUIDRange: true — OpenShift’s restricted-v2 SCC assigns container UIDs from a per-namespace range. PVC content gets written with those UIDs. Without this flag, restored pods can hit permission errors trying to read their own PVCs.

  • excludeResourceSelector on the registries namespace — the platform-level Model Catalog (model-catalog Deployment + model-catalog-postgres Deployment and PVC) is managed by the DSC and lives in rhoai-model-registries for organizational reasons. It survives the workload delete because we don’t touch it, but the backup snapshot captured it. Without the exclude, Trilio’s PVC validation rejects the restore because the model-catalog-postgres PVC already exists. The exclude says “leave the platform stuff alone, restore only what was specific to my workload.”

I apply it and watch:

    
     $ oc get clusterrestore rhoai-restore-pit -w

    
   

Phases progress: PreClusterRestoreRestore (with ValidationPrimitiveMetadataRestoreDataRestoreMetadataRestoreCleanup). The whole thing completes in under five minutes on this dataset. The DataRestore phase — the one that streams PVC content from the NFS target back into freshly provisioned ceph volumes — completes in 76 seconds for ~250 MiB of MariaDB content plus the workbench PVC.

When the parent ClusterRestore flips to Available, I check the state:

    
     $ oc get notebook,inferenceservice,servingruntime,dspa,workflow -n rhoai-demo
NAME                                      AGE
notebook.kubeflow.org/rodolfo-workbench   2m

NAME                                         URL                                                READY
inferenceservice.serving.kserve.io/iris      http://iris-predictor.rhoai-demo.svc...            True
inferenceservice.serving.kserve.io/opt125m   http://opt125m-predictor.rhoai-demo.svc...         True

NAME                                                   DISABLED   MODELTYPE
servingruntime.serving.kserve.io/mlserver-runtime                 sklearn
servingruntime.serving.kserve.io/vllm-cpu-x86-runtime             vLLM

NAME                                                                                   AGE
datasciencepipelinesapplication.datasciencepipelinesapplications.opendatahub.io/dspa   2m

NAME                                                  PHASE       PROGRESS
workflow.argoproj.io/iris-classifier-training-cxxwr   Succeeded   5/5

    
   

The workbench is back. Both inference services are Ready. The DSPA is rebuilt with all seven pipeline pods. The pipeline run history is preserved.

Iris prediction, exactly as before:

    
     $ curl -s -X POST http://iris-predictor.rhoai-demo.svc.cluster.local:8080/v2/models/iris/infer ...
{"outputs":[{"data":[0,2]}]}

    
   

vLLM, generating again:

    
     $ curl -s -X POST http://opt125m-predictor.rhoai-demo.svc.cluster.local:8080/v1/completions ...
{"choices":[{"text":" 2023\nWith the effective implementation of digital transformation..."}]}

    
   

I open the workbench in the dashboard — and my notes from before the disaster are there, in the same notebook, with the same cell outputs. The Model Registry shows iris-classifier v1.0.0 still registered, lineage intact. The Pipelines tab shows iris-classifier-training with the historical iris-run-1 and its Succeeded status preserved.

That’s full state. Not “the cluster came back” — the application came back.

Things I learned the hard way

A short list of gotchas I hit during this build that are worth knowing if you’re going to try this on your own cluster.

  1. vLLM CPU eats memory. Default KV cache is 31 GiB. Set VLLM_CPU_KVCACHE_SPACE=1 on the ServingRuntime, not the Deployment.

  1. The NooBaa OBC cascade-deletes the bucket. If you oc delete namespace, your S3 model artifacts go with it. Either do a surgical workload-only delete (like I did), or use an external S3 bucket whose lifecycle is independent of the namespace.

  1. The Model Registry webhook ties the CR‘s namespace to DSC. ModelRegistry instances must live in the namespace listed in DataScienceCluster.spec.components.modelregistry.registriesNamespace. JSON-patch transforms during a cross-namespace restore can’t bypass this — the fix for cluster-to-cluster migration is patching DSC on the destination before restore. For in-place restore on the same cluster, this isn’t an issue.

  1. The platform model-catalog-postgres PVC will block a restore. It’s DSC-managed, lives in the registries namespace, survives the delete, and conflicts with Trilio’s PVC validation when restore tries to recreate it. Exclude it via excludeResourceSelector.

  1. KServe predictor Service is headless. clusterIP: None means DNS resolves straight to pod IPs and the port 80 → targetPort 8080 mapping in the Service doesn’t apply. Always hit :8080 directly.

There’s a longer list — the workbench StatefulSet orphans, the DSPA port 8888 HTTPS trap, the Model Registry REST API requiring IDs in the body, the OpenShift namespace UID range mismatch — but those five are the ones that bit me hardest.

What this means for FSI RHOAI deployments

If you’re in financial services and you’re putting models into production on RHOAI, you’re going to be asked three questions by your risk team:

  1. Can you reproduce the exact state of a model that was in production when a specific decision was made?

  2. Can you recover the entire MLOps platform fast enough to meet your stated RTO?

  3. Can you survive a ransomware event in which everything in the cluster is encrypted?

Answering “yes” to all three requires backing up the workload, not just the pods. It requires preserving the model registry lineage, the pipeline run history, the workbench state, the inference service configurations, and the connective tissue (secrets, OBC bindings, ServingRuntimes) as a single consistent unit. And it requires being able to do that on a schedule that matches your business RPO — for most FSI customers, that means at least daily Fulls plus hourly Incrementals.

Trilio for Kubernetes does all of that with one ClusterBackupPlan and one ClusterRestore per recovery event. The walkthrough above is the proof.

Summary

We built a real RHOAI 3.x DSProject from scratch: workbench with persistent storage, a Data Science Pipeline that actually ran, a Model Registry with a registered model, two KServe inference services (one classical, one generative on vLLM), and the S3 connection that ties them together. Then we protected the whole thing with one Trilio for Kubernetes backup plan, deleted the workload, and restored it in under five minutes — with notes in the workbench, lineage in the registry, run history in the pipelines, and predictions coming out of both models.

The point isn’t that Trilio for Kubernetes can take a backup. The point is that what gets backed up is the right thing — the AI/ML state your data science team and your auditors care about, captured atomically and recoverable as a single consistent unit.

Thank you for getting this far! I hope it was useful, and if you want to run the exact same walkthrough on your own cluster, you can.

Useful Links

Sharing

Author

Picture of Rodolfo Casas

Rodolfo Casas

Rodolfo Casás is the Director of Product at Trilio with a special focus on cloud-native computing and virtualization, sovereign clouds,  hybrid cloud strategies, telco and data protection.

Related Articles

Copyright © 2026 by Trilio

Powered by Trilio