A few weeks ago I was on a call with a financial services customer who had moved a credit-decisioning model into production on Red Hat OpenShift AI. They were happy with the platform. They were less happy with the answer they had for a question their risk officer had just asked:
“If an attacker encrypts the cluster tomorrow, what do we need to bring back to be inference-ready by Monday morning?”
The team started listing the obvious things — the model artifact, the serving endpoint. Then it got harder. The model registry that proves which version is in production. The pipeline run history the auditor had asked to see. The Jupyter workbenches their data scientists were actively using. The S3 credentials that connect all of it together.
It became clear pretty quickly that “back up the cluster” doesn’t mean what it used to. RHOAI is more than a model server — it’s a full MLOps platform, and each component has its own state. If you treat it like a stateless web app, the things regulators care most about are the things you lose first.
In this post I’m going to walk through what end-to-end protection of an RHOAI 3.x DSProject actually looks like with Trilio for Kubernetes. I’ll build a real workload, take a backup, blow the workload away, and bring it back from the backup — same notes in the workbench, same registered model, same pipeline run history, same predictions coming out of both a classical scikit-learn model and a vLLM-served generative model.
Our goal here is to show what a “good” Red Hat OpenShift AI protection plan covers — and to surface the gotchas I hit so you don’t have to.
Prerequisites
If you want to follow along on your own cluster, you’ll need:
- OpenShift 4.18 or later (I used 4.20)
- Red Hat OpenShift AI 3.3 or later, with the Dashboard, KServe, Data Science Pipelines, Workbenches, and Model Registry components enabled in the DataScienceCluster
- A CSI storage class that supports volume snapshots (I’m using OpenShift Data Foundation with ocs-storagecluster-ceph-rbd and ocs-storagecluster-cephfs)
- NooBaa (ships with ODF) — or any S3-compatible object store you can point KServe at
- Trilio for Kubernetes 5.3 + installed in trilio-system, with a valid license and a healthy backup Target (mine is NFS-backed)
- oc CLI logged in as a cluster admin
A real lab — not a minikube. If you don’t have ODF, any storage class with CSI snapshots will work, but the storage class names in the examples below will need adjustment.
What we're going to build
Two namespaces:
- rhoai-demo — the application namespace. Hosts the workbench, the pipeline server, the KServe inference services, and the S3 data connection.
- rhoai-model-registries — where the RHOAI Model Registry instance lives. The Model Registry component in RHOAI 3.x binds every ModelRegistry instance to one specific namespace via the DataScienceCluster config, so this is a separate scope.
Inside those two namespaces, we’ll have:
- A Jupyter workbench (Notebook CR + StatefulSet + 20 GiB cephfs PVC)
- A Data Science Pipelines server (DSPA + 7 pods + MariaDB)
- A registered model (ModelRegistry instance + its MariaDB backend + REST endpoint)
- Two InferenceService resources: one classical (sklearn iris on MLServer), one generative (facebook/opt-125m on vLLM)
- An ObjectBucketClaim providing an S3 bucket on NooBaa, plus the data-connection Secret that ties KServe and DSPA to it
One ClusterBackupPlan in Trilio will cover both namespaces in a single consistent backup. That’s the plan.
Step 1 — Enabling OpenShift AI Model Registry and creating the project
By default, the Model Registry component in RHOAI is set to Removed in the DataScienceCluster. We need to turn it on
oc patch datasciencecluster default-dsc --type=merge -p '{
"spec":{"components":{"modelregistry":{
"managementState":"Managed",
"registriesNamespace":"rhoai-model-registries"
}}}
}'
The Model Registry component is Removed by default in the DataScienceCluster for a couple of practical reasons. First, it only graduated from Tech Preview to GA in RHOAI 3.x — Red Hat keeps recently-promoted components opt-in for a release or two so customers consciously choose them rather than waking up to new operators after a routine upgrade. Second, the operator alone isn’t enough: a working Model Registry needs an external MySQL or MariaDB database that the DSC doesn’t provision for you. Defaulting to Removed avoids leaving an idle operator running in clusters where nobody has provided the backing database. Turning it on is a one-line patch, fully reversible, and only enables the operator — you still create the ModelRegistry instance and its database yourself in Step 4.
This takes a couple of minutes to reconcile. When it’s done, the component reports Ready:
$ oc get datasciencecluster default-dsc -o jsonpath='{.status.conditions[?(@.type=="ModelRegistryReady")].status}'
True
The operator also auto-creates the rhoai-model-registries namespace. Good. Now I create the application namespace and label it so it shows up in the RHOAI dashboard:
oc create namespace rhoai-demo
oc label namespace rhoai-demo opendatahub.io/dashboard=true modelmesh-enabled=false --overwrite
oc annotate namespace rhoai-demo \
openshift.io/display-name="RHOAI Demo" \
openshift.io/description="Trilio + RHOAI demo project" --overwrite
If I refresh the RHOAI dashboard, RHOAI Demo shows up under Data Science Projects. Empty, but visible.
Step 2 — S3, the data connection, and why I'm using the internal NooBaa endpoint
KServe and Data Science Pipelines both need an S3 endpoint to read models and write artifacts. I want my models stored somewhere durable and separate from any single pod’s PVC, so I’m using NooBaa via an ObjectBucketClaim:
apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
name: rhoai-demo-bucket
namespace: rhoai-demo
spec:
generateBucketName: rhoai-demo
storageClassName: openshift-storage.noobaa.io
After applying, the OBC moves to Bound and NooBaa generates two artifacts in my namespace: a Secret with AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, and a ConfigMap with the bucket name and endpoint:
$ oc get obc -n rhoai-demo
NAME STORAGE-CLASS PHASE AGE
rhoai-demo-bucket openshift-storage.noobaa.io Bound 30s
Now I have to create an RHOAI-format data connection Secret that pulls those credentials together. Important detail: the AWS_S3_ENDPOINT here is the internal NooBaa service, not the external route.
apiVersion: v1
kind: Secret
metadata:
name: aws-connection-rhoai-demo-bucket
namespace: rhoai-demo
labels:
opendatahub.io/dashboard: "true"
opendatahub.io/managed: "true"
annotations:
opendatahub.io/connection-type: s3
openshift.io/display-name: rhoai-demo-bucket
type: Opaque
stringData:
AWS_ACCESS_KEY_ID:
AWS_SECRET_ACCESS_KEY:
AWS_DEFAULT_REGION: us-east-1
AWS_S3_BUCKET:
AWS_S3_ENDPOINT: https://s3.openshift-storage.svc
Why the internal endpoint and not the external route? I tried the external route first (s3-openshift-storage.apps.<cluster>...) and the DSPA pod immediately blew up with:
tls: failed to verify certificate: x509: certificate signed by unknown authority
The external NooBaa route is signed by the cluster’s default ingress cert, which is self-signed and not in any pod’s trust chain by default. The internal service endpoint uses the OpenShift Service CA, which every pod automatically trusts. Switch to the internal endpoint and the problem disappears.
Step 3 — Deploying the pipeline server and running a real pipeline
Now I deploy a Data Science Pipelines Application (DSPA). This will give me the apiserver, the workflow controller, the persistence agent, the scheduled-workflow controller, the metadata services, and a MariaDB to hold pipeline definitions and run history.
apiVersion: datasciencepipelinesapplications.opendatahub.io/v1
kind: DataSciencePipelinesApplication
metadata:
name: dspa
namespace: rhoai-demo
spec:
apiServer:
deploy: true
enableSamplePipeline: false
persistenceAgent:
deploy: true
scheduledWorkflow:
deploy: true
objectStorage:
externalStorage:
bucket:
host: s3.openshift-storage.svc
port: "443"
scheme: https
region: us-east-1
secure: true
s3CredentialsSecret:
accessKey: AWS_ACCESS_KEY_ID
secretKey: AWS_SECRET_ACCESS_KEY
secretName: aws-connection-rhoai-demo-bucket
database:
disableHealthCheck: false
mariaDB:
deploy: true
pipelineDBName: mlpipeline
storageClassName: ocs-storagecluster-ceph-rbd
Couple of minutes later I have seven pods running:
$ oc get pods -n rhoai-demo
NAME READY STATUS RESTARTS AGE
ds-pipeline-dspa-697687986b-988tx 2/2 Running 0 2m
ds-pipeline-metadata-envoy-dspa-c5c7549d6-9tmjl 2/2 Running 0 2m
ds-pipeline-metadata-grpc-dspa-b59b76b68-r86zw 1/1 Running 0 2m
ds-pipeline-persistenceagent-dspa-dc758b4f7-6xnmv 1/1 Running 0 2m
ds-pipeline-scheduledworkflow-dspa-66788677-kbdld 1/1 Running 0 2m
ds-pipeline-workflow-controller-dspa-694d4b8c9d-sc6tc 1/1 Running 0 2m
mariadb-dspa-54fb668576-bkmc9 1/1 Running 0 3m
Now I want a real pipeline in there — not an empty server. I write a tiny KFP v2 pipeline that trains the iris dataset:
from kfp import dsl, compiler
@dsl.component(
base_image="registry.access.redhat.com/ubi9/python-311:latest",
packages_to_install=["scikit-learn==1.5.0", "joblib"]
)
def train_iris(accuracy_threshold: float = 0.9) -> str:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)
clf = RandomForestClassifier(n_estimators=10, random_state=42).fit(Xtr, ytr)
return f"trained-acc-{clf.score(Xte, yte):.4f}"
@dsl.component(base_image="registry.access.redhat.com/ubi9/python-311:latest")
def register(model_info: str) -> str:
return f"registered:{model_info}"
@dsl.pipeline(name="iris-classifier-training")
def iris_pipeline(accuracy_threshold: float = 0.9):
t = train_iris(accuracy_threshold=accuracy_threshold)
register(model_info=t.output)
compiler.Compiler().compile(iris_pipeline, "/tmp/iris_pipeline.yaml")
I run that inside a Kubernetes Job, then POST the resulting YAML to the DSPA apiserver via its multipart upload endpoint. Two things tripped me up here:
-
DSPA’s port 8888 serves HTTPS, not HTTP, even though the port name is
http. ThepodToPodTLSsetting flips on by default. Easy fix —https://andcurl -k. -
The DSPA
NetworkPolicyonly allows ingress on 8888 from pods labeledopendatahub.io/workbenches: "true", plus a few other DSPA component selectors. My Job pod needed that label or the connection hangs.
With those two fixes, the upload and run trigger work:
$ oc get workflow -n rhoai-demo
NAME PHASE PROGRESS
iris-classifier-training-cxxwr Succeeded 5/5
The run shows up in the RHOAI Pipelines tab, complete with the parameters I passed and the step-by-step status.
Step 4 — Model Registry: MariaDB backend and registering a model via the API
The Model Registry component in RHOAI 3.x installs the operator and the Model Catalog. It does not create a ModelRegistry instance for you, and the registry server doesn’t ship with its own database — you bring your own MySQL/MariaDB.
I deploy a small MariaDB, then create the ModelRegistry CR that points at it:
apiVersion: modelregistry.opendatahub.io/v1beta1
kind: ModelRegistry
metadata:
name: rhoai-registry
namespace: rhoai-model-registries
spec:
grpc: {}
rest: {}
oauthProxy: {}
mysql:
host: model-registry-db
port: 3306
database: model_registry
username: mlmduser
passwordSecret:
key: database-password
name: model-registry-db
skipDBCreation: false
The operator reconciles this into a Deployment, Service, Route, and OAuth proxy in front of the REST and gRPC ports. I want to register the iris model I trained in step 3, so I need to talk to the registry’s API.
The external route is OAuth-protected and meant for human use. For machine-to-machine registration I can hit the registry pod’s :8080 directly — but only after relaxing the NetworkPolicy that the operator installs:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-same-namespace-to-registry-8080
namespace: rhoai-model-registries
spec:
podSelector:
matchLabels:
app: rhoai-registry
app.kubernetes.io/name: rhoai-registry
component: model-registry
policyTypes: [Ingress]
ingress:
- from:
- podSelector: {}
ports:
- {protocol: TCP, port: 8080}
Plus a tiny additional Service that targets port 8080 (the registry’s default Service only exposes 8443):
apiVersion: v1
kind: Service
metadata:
name: rhoai-registry-internal
namespace: rhoai-model-registries
spec:
type: ClusterIP
ports:
- {port: 8080, targetPort: 8080, name: http-api}
selector:
app: rhoai-registry
app.kubernetes.io/name: rhoai-registry
component: model-registry
Now I can POST to the registry. The Kubeflow Model Registry REST API has a small quirk worth noting: child-resource IDs go in the body, not just the URL path:
# Create the registered model
curl -s -X POST http://rhoai-registry-internal.rhoai-model-registries.svc:8080/api/model_registry/v1alpha3/registered_models \
-H "Content-Type: application/json" \
-d '{"name":"iris-classifier","description":"RandomForest iris classifier for the Trilio demo","owner":"rodolfo","state":"LIVE"}'
# Create a version (registeredModelId in body)
curl -s -X POST http://.../api/model_registry/v1alpha3/model_versions \
-d '{"registeredModelId":"1","name":"v1.0.0","author":"rodolfo","state":"LIVE"}'
# Create the artifact pointer (modelVersionId in body)
curl -s -X POST http://.../api/model_registry/v1alpha3/model_versions/2/artifacts \
-d '{"modelVersionId":"2","name":"iris-model-artifact","modelFormatName":"sklearn","storageKey":"aws-connection-rhoai-demo-bucket","storagePath":"models/iris","uri":"s3:///models/iris","artifactType":"model-artifact","state":"LIVE"}'
I open the registry in the RHOAI dashboard and there’s my iris-classifier v1.0.0, pointing at the S3 path where the actual model lives.
Step 5 — KServe: serving classical and generative in the same namespace
Two ServingRuntime resources, two InferenceService resources. Same namespace.
For the classical model I’m using MLServer:
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: mlserver-runtime
namespace: rhoai-demo
spec:
containers:
- name: kserve-container
image: registry.redhat.io/rhoai/odh-mlserver-rhel9@sha256:bd44d...
env:
- {name: MLSERVER_MODEL_NAME, value: '{{.Name}}'}
- {name: MLSERVER_HTTP_PORT, value: "8080"}
- {name: MODELS_DIR, value: /mnt/models}
ports: [{containerPort: 8080, protocol: TCP}]
supportedModelFormats:
- {name: sklearn, version: "1", autoSelect: true}
For the generative model I’m using vLLM, specifically the x86 CPU build that ships in the RHOAI catalog. And this is where I lost an hour the first time around. With the default vLLM environment, the CPU runtime tries to allocate roughly 31 GiB for its KV cache, which is fine if you have that kind of memory budget — and not fine if you don’t. My pods got OOMKilled within 30 seconds, every time.
The fix is one environment variable on the ServingRuntime:
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: vllm-cpu-x86-runtime
namespace: rhoai-demo
spec:
containers:
- name: kserve-container
image: registry.redhat.io/rhaiis/vllm-cpu-rhel9@sha256:f05e7...
command: [python, -m, vllm.entrypoints.openai.api_server]
args:
- --port=8080
- --model=/mnt/models
- --served-model-name={{.Name}}
- --dtype=float32
- --max-model-len=512
env:
- {name: HF_HOME, value: /tmp/hf_home}
- {name: VLLM_CPU_KVCACHE_SPACE, value: "1"} # critical
resources:
requests: {cpu: "1", memory: 5Gi}
limits: {cpu: "2", memory: 8Gi}
VLLM_CPU_KVCACHE_SPACE=1 caps the KV cache at 1 GiB, which is plenty for facebook/opt-125m and any other small model you’d put on CPU for a lab. Important detail — set this on the ServingRuntime, not directly on the rendered Deployment. KServe re-renders the Deployment from the runtime spec, so any direct patch gets clobbered on the next reconcile.
I preload facebook/opt-125m into the S3 bucket via a Job that snapshots from HuggingFace and uploads to NooBaa, then create the two InferenceService resources:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: iris
namespace: rhoai-demo
spec:
predictor:
minReplicas: 1
model:
modelFormat: {name: sklearn, version: "1"}
runtime: mlserver-runtime
storage:
key: aws-connection-rhoai-demo-bucket
path: models/iris
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: opt125m
namespace: rhoai-demo
spec:
predictor:
minReplicas: 1
model:
modelFormat: {name: vLLM}
runtime: vllm-cpu-x86-runtime
storage:
key: aws-connection-rhoai-demo-bucket
path: models/opt-125m
A couple of minutes later, both are serving. Quick test against iris:
$ curl -s -X POST http://iris-predictor.rhoai-demo.svc.cluster.local:8080/v2/models/iris/infer \
-H "Content-Type: application/json" \
-d '{"inputs":[{"name":"predict","shape":[2,4],"datatype":"FP32","data":[[5.1,3.5,1.4,0.2],[6.7,3.1,5.6,2.4]]}]}'
{"model_name":"iris","model_version":"v1.0.0","outputs":[{"name":"predict","shape":[2,1],"datatype":"INT64","data":[0,2]}]}
Class 0 (setosa) and class 2 (virginica). Correct.
And vLLM:
$ curl -s -X POST http://opt125m-predictor.rhoai-demo.svc.cluster.local:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"opt125m","prompt":"Trilio protects RHOAI by","max_tokens":20}'
{"id":"cmpl-912b...","model":"opt125m","choices":[{"text":" providing the best data collection, quality, and scalability for RHOAI workloads..."}]}
A small model hallucinates a small answer, but the plumbing works.
Important note on the headless service: if you try curl iris-predictor.<ns>.svc:80/... and get connection refused — that’s expected. KServe RawDeployment mode uses a headless Service (clusterIP: None), so DNS resolves directly to pod IPs and port mapping (80 → 8080) doesn’t apply. Always hit :8080 directly.
Step 6 — Creating a Trilio backup
Now for the part this whole post is about. I create one ClusterBackupPlan that covers both namespaces in one consistent backup:
apiVersion: triliovault.trilio.io/v1
kind: ClusterBackupPlan
metadata:
name: rhoai-demo-plan
spec:
backupConfig:
target:
name: tvk-target
namespace: trilio-system
backupComponents:
- namespace: rhoai-demo
- namespace: rhoai-model-registries
Then trigger a Full backup:
oc apply -f - <
About six minutes later:
$ oc get clusterbackup
NAME TYPE STATUS DATA SIZE DURATION
rhoai-demo-backup-baseline Full Available 2750685184 6m22s
As Trilio runs inside the cluster, THAT´S WHY we can use CRDs, and oc apply
But we also have a UI
And also an OpenShift console plugin, if you don’t want to leave the OCP console moving forward, or to quickly check everything looks good
2.75 GiB of consistent, restorable state across the two namespaces. That includes the workbench’s PVC content, the MariaDB databases for both DSPA and the model registry, every CR (workbench Notebook, DSPA, InferenceService, ServingRuntime, ModelRegistry, OBC, Secrets, NetworkPolicies, etc.), and even the container image references for an air-gapped restore scenario.
Now I open the workbench in JupyterLab, add a few cells of notes — “this is the state I want to be able to come back to”
Save the notebook, and trigger an Incremental:
apiVersion: triliovault.trilio.io/v1
kind: ClusterBackup
metadata:
name: rhoai-demo-backup-with-notes
spec:
clusterBackupPlan:
name: rhoai-demo-plan
type: Incremental
$ oc get clusterbackup
NAME TYPE STATUS DATA SIZE
rhoai-demo-backup-baseline Full Available 2750685184
rhoai-demo-backup-with-notes Incremental Available 25219072
25 MB. That’s the delta — the changed blocks in the workbench PVC plus the few CR updates that happened since the Full. Two restore points, two business decisions: “back to before I added the notes” or “back to right now.”
Step 7 — The disaster
I want to show this is a real recovery, not a no-op. So I’m going to delete everything that defines the workload — keeping the namespaces themselves and the OBC alive (more on why in a moment):
oc delete -n rhoai-demo notebook,datasciencepipelinesapplication,inferenceservice,servingruntime,workflow,statefulset --all
oc delete pvc -n rhoai-demo --all
oc delete -n rhoai-model-registries modelregistry rhoai-registry
oc delete -n rhoai-model-registries deployment model-registry-db svc model-registry-db pvc model-registry-db
After that:
$ oc get notebook,inferenceservice,servingruntime,dspa,pvc -n rhoai-demo
No resources found in rhoai-demo namespace.
$ oc get modelregistry.modelregistry.opendatahub.io -n rhoai-model-registries
No resources found in rhoai-model-registries namespace.
The workbench is gone. The Pipelines tab in the RHOAI dashboard says “No pipelines deployed.” The Models tab is empty. The Model Registry section says no registries available.
Why I didn’t oc delete namespace: the ObjectBucketClaim has a default reclaim policy of Delete. If I delete the namespace, the OBC cascades to NooBaa, NooBaa deletes the bucket, and the iris and opt-125m model files in S3 are gone with it. The restore would bring back the InferenceService CRs but they’d fail to load because the artifacts they reference would not exist. A surgical workload delete keeps the bucket alive and the model files intact. For real production DR you’d want external S3 that’s outside the namespace lifecycle entirely — that’s a separate conversation.
Step 8 — Restore in under five minutes
One ClusterRestore brings it all back:
apiVersion: triliovault.trilio.io/v1
kind: ClusterRestore
metadata:
name: rhoai-restore-pit
spec:
source:
type: ClusterBackup
clusterBackup:
name: rhoai-demo-backup-with-notes
components:
- sourceNamespace: rhoai-demo
restoreNamespace: rhoai-demo
restoreConfig:
restoreFlags:
skipIfAlreadyExists: true
useOCPNamespaceUIDRange: true
- sourceNamespace: rhoai-model-registries
restoreNamespace: rhoai-model-registries
restoreConfig:
restoreFlags:
skipIfAlreadyExists: true
useOCPNamespaceUIDRange: true
excludeResourceSelector:
gvkSelector:
- groupVersionKind:
kind: PersistentVolumeClaim
version: v1
objects: [model-catalog-postgres]
- groupVersionKind:
kind: Deployment
group: apps
version: v1
objects: [model-catalog, model-catalog-postgres]
Two flags and one exclude rule worth explaining:
-
skipIfAlreadyExists: true— OpenShift and RHOAI auto-inject a bunch of resources into every namespace (default ServiceAccounts, RBAC, the trusted-CA-bundle ConfigMap). They survive the delete and will be present when restore runs. This flag tells Trilio to leave them alone instead of choking on the conflict. -
useOCPNamespaceUIDRange: true— OpenShift’srestricted-v2SCC assigns container UIDs from a per-namespace range. PVC content gets written with those UIDs. Without this flag, restored pods can hit permission errors trying to read their own PVCs. -
excludeResourceSelectoron the registries namespace — the platform-level Model Catalog (model-catalogDeployment +model-catalog-postgresDeployment and PVC) is managed by the DSC and lives inrhoai-model-registriesfor organizational reasons. It survives the workload delete because we don’t touch it, but the backup snapshot captured it. Without the exclude, Trilio’s PVC validation rejects the restore because themodel-catalog-postgresPVC already exists. The exclude says “leave the platform stuff alone, restore only what was specific to my workload.”
I apply it and watch:
$ oc get clusterrestore rhoai-restore-pit -w
Phases progress: PreClusterRestore → Restore (with Validation → PrimitiveMetadataRestore → DataRestore → MetadataRestore → Cleanup). The whole thing completes in under five minutes on this dataset. The DataRestore phase — the one that streams PVC content from the NFS target back into freshly provisioned ceph volumes — completes in 76 seconds for ~250 MiB of MariaDB content plus the workbench PVC.
When the parent ClusterRestore flips to Available, I check the state:
$ oc get notebook,inferenceservice,servingruntime,dspa,workflow -n rhoai-demo
NAME AGE
notebook.kubeflow.org/rodolfo-workbench 2m
NAME URL READY
inferenceservice.serving.kserve.io/iris http://iris-predictor.rhoai-demo.svc... True
inferenceservice.serving.kserve.io/opt125m http://opt125m-predictor.rhoai-demo.svc... True
NAME DISABLED MODELTYPE
servingruntime.serving.kserve.io/mlserver-runtime sklearn
servingruntime.serving.kserve.io/vllm-cpu-x86-runtime vLLM
NAME AGE
datasciencepipelinesapplication.datasciencepipelinesapplications.opendatahub.io/dspa 2m
NAME PHASE PROGRESS
workflow.argoproj.io/iris-classifier-training-cxxwr Succeeded 5/5
The workbench is back. Both inference services are Ready. The DSPA is rebuilt with all seven pipeline pods. The pipeline run history is preserved.
Iris prediction, exactly as before:
$ curl -s -X POST http://iris-predictor.rhoai-demo.svc.cluster.local:8080/v2/models/iris/infer ...
{"outputs":[{"data":[0,2]}]}
vLLM, generating again:
$ curl -s -X POST http://opt125m-predictor.rhoai-demo.svc.cluster.local:8080/v1/completions ...
{"choices":[{"text":" 2023\nWith the effective implementation of digital transformation..."}]}
I open the workbench in the dashboard — and my notes from before the disaster are there, in the same notebook, with the same cell outputs. The Model Registry shows iris-classifier v1.0.0 still registered, lineage intact. The Pipelines tab shows iris-classifier-training with the historical iris-run-1 and its Succeeded status preserved.
That’s full state. Not “the cluster came back” — the application came back.
Things I learned the hard way
A short list of gotchas I hit during this build that are worth knowing if you’re going to try this on your own cluster.
-
vLLM CPU eats memory. Default KV cache is 31 GiB. Set
VLLM_CPU_KVCACHE_SPACE=1on theServingRuntime, not the Deployment.
-
The NooBaa OBC cascade-deletes the bucket. If you
oc delete namespace, your S3 model artifacts go with it. Either do a surgical workload-only delete (like I did), or use an external S3 bucket whose lifecycle is independent of the namespace.
-
The Model Registry webhook ties the CR‘s namespace to DSC.
ModelRegistryinstances must live in the namespace listed inDataScienceCluster.spec.components.modelregistry.registriesNamespace. JSON-patch transforms during a cross-namespace restore can’t bypass this — the fix for cluster-to-cluster migration is patching DSC on the destination before restore. For in-place restore on the same cluster, this isn’t an issue.
-
The platform
model-catalog-postgresPVC will block a restore. It’s DSC-managed, lives in the registries namespace, survives the delete, and conflicts with Trilio’s PVC validation when restore tries to recreate it. Exclude it viaexcludeResourceSelector.
-
KServe predictor
Serviceis headless.clusterIP: Nonemeans DNS resolves straight to pod IPs and theport 80 → targetPort 8080mapping in the Service doesn’t apply. Always hit:8080directly.
There’s a longer list — the workbench StatefulSet orphans, the DSPA port 8888 HTTPS trap, the Model Registry REST API requiring IDs in the body, the OpenShift namespace UID range mismatch — but those five are the ones that bit me hardest.
What this means for FSI RHOAI deployments
If you’re in financial services and you’re putting models into production on RHOAI, you’re going to be asked three questions by your risk team:
-
Can you reproduce the exact state of a model that was in production when a specific decision was made?
-
Can you recover the entire MLOps platform fast enough to meet your stated RTO?
-
Can you survive a ransomware event in which everything in the cluster is encrypted?
Answering “yes” to all three requires backing up the workload, not just the pods. It requires preserving the model registry lineage, the pipeline run history, the workbench state, the inference service configurations, and the connective tissue (secrets, OBC bindings, ServingRuntimes) as a single consistent unit. And it requires being able to do that on a schedule that matches your business RPO — for most FSI customers, that means at least daily Fulls plus hourly Incrementals.
Trilio for Kubernetes does all of that with one ClusterBackupPlan and one ClusterRestore per recovery event. The walkthrough above is the proof.
Summary
We built a real RHOAI 3.x DSProject from scratch: workbench with persistent storage, a Data Science Pipeline that actually ran, a Model Registry with a registered model, two KServe inference services (one classical, one generative on vLLM), and the S3 connection that ties them together. Then we protected the whole thing with one Trilio for Kubernetes backup plan, deleted the workload, and restored it in under five minutes — with notes in the workbench, lineage in the registry, run history in the pipelines, and predictions coming out of both models.
The point isn’t that Trilio for Kubernetes can take a backup. The point is that what gets backed up is the right thing — the AI/ML state your data science team and your auditors care about, captured atomically and recoverable as a single consistent unit.
Thank you for getting this far! I hope it was useful, and if you want to run the exact same walkthrough on your own cluster, you can.