Whitepaper: Trilio Site Recovery (TSR) — DR for Kubernetes-native VMs

OpenShift is a Kubernetes-based platform that provides a consistent way to deploy, manage, and scale containerized applications across different environments. It extends Kubernetes with built-in tooling, automation, and enterprise features that simplify operations. OpenShift Virtualization Engine builds on this foundation by allowing virtual machines and containers to run side by side within the same cluster.

Modern environments often need to support both traditional virtual machines and cloud-native workloads. Running these on separate platforms increases operational overhead and complicates management. OpenShift Virtualization Engine addresses this issue by consolidating both workload types under a single control plane and operational model.

This article focuses on deploying and operating OpenShift Virtualization Engine on an existing cluster (not covering cluster provisioning). It walks through architecture, installation, storage configuration, and virtual machine lifecycle operations. The goal is to provide a practical, hands-on guide for running virtual machines reliably in production environments.

Summary of key best practices for optimizing OpenShift virtualization engine

 

Best practice

Description

Validate hardware virtualization readiness

Confirm that CPU virtualization extensions are enabled to avoid runtime and scheduling failures.

Define storage for VM workloads

Isolate VM disks on appropriate storage classes to improve performance and reliability.

Perform conservative planning

Allocate CPU and memory conservatively during initial deployment to maintain cluster stability.

Monitor storage provisioning

Track data volume and PVC status to detect import delays or provisioning issues early.

Validate lifecycle operations 

Validate start, stop, restart, and migration workflows to ensure operational readiness.

Automated Application-Centric Red Hat OpenShift Data Protection & Intelligent Recovery

Infrastructure preparation

Before you start using OpenShift Virtualization Engine, you will have to make sure that you have your Red Hat cluster ready. 

Cluster and platform requirements

You will need the following:

  • A Red Hat OpenShift cluster (version 4.10 or later) with administrator access
  • Access to the OpenShift Web Console and a configured oc CLI
  • Deployment on a supported platform (bare metal, VMware vSphere, AWS, etc.)
  • Sufficient cluster capacity to support virtualization workloads

OpenShift Virtualization Engine builds on the standard OpenShift architecture. It enables virtual machines and containers to run on the same platform without introducing a separate management layer. The control plane manages cluster state and scheduling, while worker nodes execute both containerized and virtual machine workloads.

The OpenShift cluster architecture from the official Red Hat documentation (source)

The OpenShift cluster architecture from the official Red Hat documentation (source)

Validate hardware virtualization on worker hosts

On each worker node, confirm that the CPU exposes virtualization extensions (vmx for Intel or svm for AMD) and that KVM kernel modules are loaded. Missing flags usually indicate that virtualization is disabled at the BIOS or hypervisor level. Without these capabilities, virtual machines will fail to start.

    
     grep -E 'svm|vmx' /proc/cpuinfo
lsmod | grep kvm
grep -w nx /proc/cpuinfo | head -n1
    
   

Confirm that nodes are schedulable, and label VM workers

Check candidate workers to confirm that they are marked Ready and not cordoned. Nodes must be both Ready and schedulable. A node can report Ready while still being marked as SchedulingDisabled, which prevents workloads from being placed on it.

    
     oc get nodes -l node-role.kubernetes.io/worker -o wide
    
   

If needed, uncordon nodes to allow scheduling:

    
     oc adm uncordon
    
   

Label the worker nodes that will host virtual machines. Target that label with nodeSelector in VM manifests. If no nodes match the label, the scheduler cannot place the VM, and it remains in a Pending state.

    
     oc label node <worker-node> workload=virt --overwrite
oc get nodes -l workload=virt --show-labels
    
   

YAML:

    
     apiVersion: v1
kind: Node
metadata:
 name: <worker-node>
 labels:
   workload: virt
    
   
Deployments can use the node label kubevirt.io/schedulable=true to indicate nodes eligible for VMI scheduling.

Automated Red Hat OpenShift Data Protection & Intelligent Recovery

Perform secure application-centric backups of containers, VMs, helm & operators

Use pre-staged snapshots to instantly test, transform, and restore during recovery

Scale with fully automated policy-driven backup-and-restore workflows

Quick resource planning notes

  • Start CPU allocation conservatively: OpenShift Virtualization uses a CPU allocation ratio (example: 10:1 vCPUs-to-pCPUs) set by vmiCPUAllocationRatio in the HyperConverged CR.
  • Assume VM memory is reserved by default: Budget for per-node and per-VM overhead, including about 218 MiB for the VM’s virt-launcher pod.

Verify readiness

Finally, verify readiness. Here are three quick readiness checks:

    
     oc get nodes
oc get nodes -l workload=virt
oc debug node/<worker-node>
    
   

The nodes should show a Ready status when commands are executed. Here is an example:

    
     NAME           STATUS   ROLES    AGE   VERSION
worker-1       Ready    worker   10d   v1.xx
worker-2       Ready    worker   10d   v1.xx
    
   

Installing OpenShift Virtualization Engine

With infrastructure preparation complete (CPU virtualization validated and worker nodes schedulable), installation is primarily an Operator Lifecycle Manager (OLM) workflow surfaced through OperatorHub. Use the stable channel and the mandated openshift-cnv namespace; installing in another namespace is explicitly called out as a failure condition.

Apply the OperatorHub manifests (Namespace + OperatorGroup + Subscription). The structure below matches Red Hat’s documented CLI flow; you may omit startingCSV to avoid pinning when the OpenShift version is unspecified.

SubscriptionFile YAML:

    
     apiVersion: v1
kind: Namespace
metadata:
 name: openshift-cnv
 labels:
   openshift.io/cluster-monitoring: "true"
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
 name: kubevirt-hyperconverged-group
 namespace: openshift-cnv
spec:
 targetNamespaces: [openshift-cnv]
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
 name: hco-operatorhub
 namespace: openshift-cnv
spec:
 source: redhat-operators
 sourceNamespace: openshift-marketplace
 name: kubevirt-hyperconverged
 channel: "stable"
 # startingCSV: kubevirt-hyperconverged-operator.vX.Y.Z
    
   

The YAML block above handles the prerequisite plumbing for the virtualization stack. It configures the dedicated namespace with monitoring enabled, establishes the required OperatorGroup for scoping, and, finally, subscribes to the stable channel to begin the automated deployment of the KubeVirt components. 

Apply using OC CLI:

    
     oc apply -f SubscriptionFile.yaml
    
   

Create the minimal HyperConverged custom resource (this is the “install trigger” the Operator waits for).

HCOFile YAML:

    
     apiVersion: hco.kubevirt.io/v1beta1
kind: HyperConverged
metadata:
 name: kubevirt-hyperconverged
 namespace: openshift-cnv
spec: 
    
   

Apply using OC CLI:

    
     oc apply -f HCOFile.yaml
    
   

Readiness validation

Make sure to confirm that the CSV is Succeeded, HCO is Available=True / Degraded=False, and core components (virt-operator, virt-api, virt-controller, virt-handler) are present and Ready. 

Note that virt-launcher is per running VMI, so it typically appears only after you start a VM.

Learn KubeVirt & OpenShift Virtualization Backup & Recovery Best Practices

Configuring storage and creating a virtual machine

This section uses example resource names for consistency. Adjust these values to match your environment and naming conventions:

  • Virtual machine name: fedora-vm
  • DataVolume name: fedora-dv
  • PersistentVolumeClaim (PVC) name: fedora-pvc
  • Namespace/project: default (or current project)
  • StorageClass name: standard (or your available StorageClass)
  • VM disk image source: Fedora cloud image (HTTP/registry source)
  • virt-launcher pod name: Auto-generated (retrieved dynamically via oc get pod)

Once ready, pick a dynamic StorageClass, import an OS disk into a PVC with CDI/DataVolume, attach it to a VirtualMachine, then verify that the DataVolume is Succeeded and a VMI is running. With installation complete and core virtualization pods Ready, storage is the next gating dependency: VM disks are backed by PVCs, and CDI orchestrates imports into those PVCs.

Select a StorageClass for VM disks

List StorageClasses and identify a default; if you want CDI to prefer a “virtualization default,” you can annotate one as the default virt class.

    
     oc get sc
oc get sc -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.annotations.storageclass\.kubernetes\.io/is-default-class}{"\n"}{end}'
oc annotate sc <sc-name> storageclass.kubevirt.io/is-default-virt-class="true" --overwrite
    
   

Import a container disk image into a PVC (DataVolume)

A DataVolume can import from a container registry (source.registry.url: docker://…) and transitions to Succeeded when the import completes.

Name this as DV YAML:

    
     apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
 name: fedora-dv
spec:
 source:
   registry:
     url: "docker://kubevirt/fedora-cloud-registry-disk-demo"
 storage:
   resources:
     requests:
       storage: 6Gi
    
   

Create and start a VM, then verify the runtime. You can name this as VM YAML:

    
     apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
 name: fedora-vm
spec:
 running: false
 template:
   spec:
     domain:
       resources: { requests: { memory: 2Gi } }
       devices:
         disks: [{name: rootdisk, disk: {bus: virtio}}]
     volumes:
     - name: rootdisk
       dataVolume: { name: fedora-dv }
    
   

Apply with OC CLI:

    
     oc apply -f dv.yaml 
oc apply -f vm.yaml
virtctl start fedora-vm
oc get dv,pvc,vm,vmi -w
    
   

To access the VM: use virtctl console (serial), VNC (UI/virtctl), or cloud-init–provisioned SSH; example console access: virtctl console fedora-vm.

Monitoring and troubleshooting

After deployment, most issues fall into two categories: storage provisioning failures or scheduling and runtime errors. A consistent troubleshooting workflow helps isolate the failing layer before making configuration changes.

Run these first-line checks (workload namespace unless noted):

    
     oc get vm,vmi -n
oc describe vmi  -n
oc get events -n  --sort-by=.lastTimestamp | tail -n 20
oc get pods -n  -o wide | grep virt-launcher
oc logs -n   --since=10m
    
   

Cluster events often provide the root cause of failures, including scheduling issues, image import errors, or resource constraints.

For storage-related issues, inspect DataVolume Conditions (Bound/Running/Ready) and DV Events; this is the most direct view of CDI import/provisioning progress.

    
     oc get dv,pvc -n <vm-namespace>
oc describe dv <dv-name> -n <vm-namespace>
oc describe pvc <pvc-name> -n <vm-namespace>
    
   

For node/resource constraints, look for PodScheduled=False with an Unschedulable reason (often “Insufficient cpu/memory”), then confirm node conditions like MemoryPressure, DiskPressure, and PIDPressure.

    
     oc describe vmi  -n  | sed -n '/Conditions:/,/Events:/p'
oc describe node  | egrep 'Pressure|Allocated|Ready'
    
   

Common VM startup failures

Issue

Likely Cause

Recommended Action

DataVolumeError / import failed

Invalid source URL, registry access issues, or CDI failure

Run oc describe dv … and correct the source URL, credentials, or registry access

PVC Pending / not Bound

StorageClass mismatch, provisioning issue, or insufficient capacity

Run oc describe pvc … and adjust StorageClass configuration or parameters

virt-launcher CrashLoopBackOff

Guest OS or hypervisor process failure

Check logs using oc logs –previous … and inspect pod details with oc describe pod …

Unschedulable / ImagePullBackOff

Insufficient node resources or incorrect image configuration

Adjust resource requests/limits or fix image reference and pull configuration

For rollback and disaster recovery, OpenShift Virtualization supports VM snapshots and also documents VM backup/restore using OADP (Velero-based) with specific supported modes. For robust enterprise-grade backup and DR, organizations typically choose certified ecosystem solutions like Trilio, which provides policy-driven orchestration, application-consistent recovery, and automated multi-cluster replication.

Trilio’s Integrated Data Protection for OpenShift (source)

Recommendations and best practices

Here are some recommendations and best practices that you can follow throughout OpenShift’s lifecycle. 

Validate hardware virtualization readiness

Begin by validating that every worker node meets the required CPU capabilities. Hardware virtualization extensions (Intel VT-x or AMD-V) and the NX bit must be enabled across all nodes.

Run these checks from your admin workstation (no SSH required) to validate flags and KVM module presence node-by-node:

    
     oc get nodes -l node-role.kubernetes.io/worker -o name
for n in $(oc get nodes -l node-role.kubernetes.io/worker -o name); do
 echo "== ${n} =="; oc debug "${n}" -- chroot /host bash -c \
 'grep -m1 -E "vmx|svm" /proc/cpuinfo && grep -m1 -w nx /proc/cpuinfo && lsmod | grep -E "^kvm(_intel|_amd)?\b"'
done
    
   

If vmx/svm or nx is missing, remediate in BIOS/UEFI (enable VT-x/AMD-V and NX/Execute Disable).  If workers are themselves virtualized, verify that your hypervisor exposes nested virtualization; otherwise, move VM workloads to compatible hosts.

Define storage for VM workloads

Storage configuration directly impacts VM reliability and performance. Define a dedicated virtualization StorageClass and ensure that it supports required capabilities such as snapshots, access modes, and volume types.

Practical baseline steps:

    
     oc get sc
oc patch storageclass <storage_class_name> -p '{"metadata":{"annotations":{"storageclass.kubevirt.io/is-default-virt-class":"true"}}}'
oc get storageprofile
    
   

If you need VM snapshots/backups, ensure that your CSI stack supports the Kubernetes Volume Snapshot API and that your chosen storage profile/snapshot class is configured; disks can be excluded when snapshot class selection is ambiguous.

For performance-oriented deployments on OpenShift Data Foundation (Ceph), Red Hat recommends RBD block-mode PVCs for VM disks (which offer better performance than CephFS or filesystem-mode RBD). Here are minimal DV/PVC storage settings to reflect that preference:

    
     spec:
 storage:
   storageClassName: <your-virt-default-sc>
   volumeMode: Block
   accessModes: ["ReadWriteOnce"]
    
   

If you plan live migration, design for shared storage with RWX where required.

Perform conservative capacity planning

Avoid early oversubscription until you have workload traces. Red Hat documents explicit platform overhead: per worker node (~360 MiB RAM overhead) and per-VM overhead (including ~218 MiB for processes inside the virt-launcher pod, plus increments for vCPUs/devices). It also notes that worker nodes hosting VMs should reserve additional CPU capacity (baseline ~2 cores/node plus VM needs).

Treat CPU overcommit as a deliberate policy knob via the HyperConverged vmiCPUAllocationRatio (default 10), which normalizes the VM pod’s CPU request based on vCPU count.  Start conservatively (default or lower) and only relax after observing contention.

    
     oc patch hyperconverged kubevirt-hyperconverged -n openshift-cnv --type=merge \
 -p '{"spec":{"resourceRequirements":{"vmiCPUAllocationRatio":10}}}'
    
   

In VM specs, explicitly request memory to keep scheduling predictable.

    
     spec:
 template:
   spec:
     domain:
       resources:
         requests:
           memory: 4Gi
    
   

If you enable live migration, also plan spare memory request capacity to absorb node drains (the docs provide a simple sizing heuristic based on “nodes draining in parallel × highest total VM memory requests”).

Monitor storage provisioning 

Most “VM won’t start” incidents are storage-path failures: DataVolume never reaches success, PVC never binds, or the importer pod fails. Red Hat recommends diagnosing DVs by inspecting DV Conditions (Bound, Running, Ready) and related DV events via oc describe dv.  CDI also documents a WaitForFirstConsumer phase when the StorageClass uses that binding mode (often normal, but it changes what “stuck” looks like).

Commands worth standardizing (and alerting from, where possible):

    
     oc get dv,pvc -n
oc describe dv  -n
oc get events -n  --sort-by=.lastTimestamp | tail -n 30
oc get pods -n  | egrep 'importer|cdi'


    
   

If DV import fails, inspect the importer pod logs and validate the source URL/registry credentials; events commonly surface HTTP errors (e.g., 404 on a misspelled import URL).  If PVCs sit at Pending, confirm that an appropriate default/virt-default StorageClass exists and that provisioning parameters match the provider.

Validating lifecycle operations 

Before rollout, validate the “Day 2” operations you’ll rely on during outages and upgrades: start/stop/restart, console access, migration (if supported), and snapshot/restore (plus backup tooling where mandated). virtctl is the supported CLI for common actions like start/stop/restart/migrate/console.

Minimal lifecycle test set (record timing + pass/fail criteria):

    
     virtctl start <vm_name>; virtctl console <vmi_name>
virtctl restart <vm_name>; virtctl stop <vm_name>
virtctl migrate <vm_name>   # only after confirming RWX + migration prerequisites
    
   

Live migration requires shared RWX storage plus adequate RAM/network bandwidth, and some configurations are explicitly non-migratable (e.g., RWO disks, passthrough devices like GPUs).

For snapshot/restore, validate both offline and online behavior, and install the QEMU guest agent for highest-integrity online snapshots.

    
     apiVersion: snapshot.kubevirt.io/v1beta1
kind: VirtualMachineSnapshot
metadata: { name: <snapshot-name> }
spec:
 source: { apiGroup: kubevirt.io, kind: VirtualMachine, name: <vm-name> }
---
apiVersion: snapshot.kubevirt.io/v1beta1
kind: VirtualMachineRestore
metadata: { name: <restore-name> }
spec:
 target: { apiGroup: kubevirt.io, kind: VirtualMachine, name: <vm-name> }
 virtualMachineSnapshotName: <snapshot-name>
    
   

Quick remediation tip 

If migration or snapshot tests fail, first verify storage access mode / volume mode and snapshot class selection, then re-run tests after correcting the StorageClass/StorageProfile path.

Learn How To Best Backup & Restore Virtual Machines Running on OpenShift

Last thoughts

This article covered the deployment and operation of OpenShift Virtualization Engine, from infrastructure validation to virtual machine lifecycle management. It highlighted the importance of proper node preparation, storage configuration, and consistent monitoring practices to maintain stable VM workloads. 

For long-term operations, implementing a backup and recovery strategy using solutions such as Trilio strengthens resilience and ensures virtual machine data remains protected.

Table Of Contents

Like This Article?

Subscribe to our LinkedIn Newsletter to receive more educational content