Carahsoft Webinar: Red Hat OpenShift Virtualization + Trilio for Operational Resiliency and Native Disaster Recovery

OpenShift, Red Hat’s distribution of Kubernetes, is a container orchestration platform designed to run distributed workloads within a cluster. This distribution of workloads across multiple systems and the ephemeral nature of a cloud-native system dictate how we should think about backups, data protection, and data restoration.

Regardless of size, every organization needs to consider how it will back up its data, protect it from loss or malware, and restore it when needed. This article explores how to approach backing up and securing your data, so cluster operators can use it to migrate applications, set up testing environments, or recover from a disaster scenario. This article focuses on backups, data protection, and the restoration of data and workloads on a  Red Hat OpenShift cluster. It serves as a guide to best practices to consider when backing up your OpenShift workloads. 

As a housekeeping note, this article acknowledges the difference between restoring the cluster infrastructure to production readiness and restoring the cluster’s workloads or applications. We assume that the playbooks to restore cluster infrastructure are in place, and we dive into the data protection side.

Summary of key OpenShift backup best practices

Best PracticeCategoryDescription
Create an application-centric backup processBackup processCreate backups encompassing the entire application, including the namespace and any adjacent cluster resources, such as secrets and configmaps.
Use role-based access control to limit and audit accessSecurity and ComplianceControl who can access your data and keep an audit trail using native RBAC controls.
Make backups immutable and encrypt themSecurity and ComplianceKeep your data protected by encrypting it upon backup and making it immutable in an off-site location.
Test data restorationRestore processCreate backups often and test them regularly to ensure that they are running correctly.
Document the processDocumentationWrite process documentation that is easy to follow and can be used to help the team execute under pressure.

Create an application-centric backup process

Our first best practice is to adopt an application-centric backup process. An application-centric backup focuses on backing up an entire application rather than individual files or databases from a server. When we think about protecting our data, it is an evolution from traditional backups.

This process could be home-grown, or you could implement a platform to adopt this practice. For example, Trilio’s documentation describes how it approaches application-centric backups. It can automatically group applications based on labels, Helm charts, and operators to identify and back up any dependent resources.

Another thing to look for when managing backups through an application is the ability to automate the process. The key here is to mitigate the risk of human error interfering with the ability to create correct and complete backups.

OpenShift components: what to back up

When devising an application-centric backup strategy, it is important to understand the relationships between the resources and how they all apply to your application. You will want to make it easier on yourself by implementing a tool that can understand and handle the relationships between the various resources and storage volumes for your application and your cluster.

Let’s look at the resources that must be captured when making an application-centric backup.

Namespaces/labels

A namespace is a logical grouping of resources representing a single tenant or project in your cluster. Within a namespace, you may have many workload-related resources, such as deployments, routes, services, pods, and persistent volume claims (PVCs).

Labels are another way to group resources within a namespace or cluster-scoped resources. A label, also known as an annotation, is a key/value pair association applied to resources.

The relationships between resources need to be understood. For example, for your application, you need to know which routes and services are connected to your deployment and what persistent storage claims are bound to your application.

For example, you may have a group of pods in a namespace labeled as the prod environment and a separate group of pods labeled as the dev environment. This shows all of the pods in a sample setup with their labels.

$ oc get pods -L env
NAME                                    READY   STATUS    RESTARTS   AGE     ENV
nginx-deployment-dev-7f94978755-k9m6x   1/1     Running   0          5m27s   dev
nginx-deployment-dev-7f94978755-sct7n   1/1     Running   0          5m27s   dev
nginx-deployment-dev-7f94978755-vdb7h   1/1     Running   0          5m27s   dev
nginx-deployment-prod-dcbc57d57-bnb2r   1/1     Running   0          5m27s   prod
nginx-deployment-prod-dcbc57d57-hp8gp   1/1     Running   0          5m27s   prod
nginx-deployment-prod-dcbc57d57-nvvpl   1/1     Running   0          5m27s   prod

Deployments, routes, services, etc.

These pods are often created using replicasets and deployments. The deployment would be the resource you want to back up.

Custom resource definitions (CRDs) and metadata

Resources such as ConfigMaps and secrets are part of the application’s metadata and are used to keep the application code and configuration separate. Without these resources, a pod would not be able to run correctly.

Secrets provide information like account passwords or API keys, while ConfigMaps can provide less sensitive but still necessary information to the pods, such as environment variables or config files. 

Operators and Helm charts

Operators and Helm charts extend the functionality of your cluster. In OpenShift, these operators can also be installed via the OperatorHub. 

You will want to back up each operator and Helm chart by keeping track of the versions installed to ensure that when they are restored, the right version is installed.

Persistent volumes (PVs)

By default, pods have ephemeral storage, which means that any data generated by a pod will be deleted. Persistent volumes are where data is stored past the life of an individual pod. Data such as that contained in databases is often stored in PVs, so it is important to flush data to disk before backing up a PV.

Images

Often overlooked, the specific images used for your application should also be backed up. If, for example, your application depends on a specific release/tag of an image, you should make sure you have a backup of that image just in case it is not available from the original repository.

etcd database

The etcd database is a key-value store for OpenShift containing all the cluster state data. Backing this up is helpful if you must return the cluster to a specific state.

Step 1. Identify the application components

The first step is to identify all the resources the application depends on. The scope of this depends on the application, but you can often start with the resources contained within a namespace or those that have a shared label (annotation).

In the following example, we use the oc command line tool to look at resources. This tool is specific to OpenShift, but if you are familiar with kubectl, many of the commands correspond 1:1.

$ oc get all
Warning: apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
NAME                                         READY   STATUS    RESTARTS   AGE
pod/nginx-deployment-dev-796fc954bc-8v5xm    1/1     Running   0          2d
pod/nginx-deployment-dev-796fc954bc-cj5n6    1/1     Running   0          2d
pod/nginx-deployment-dev-796fc954bc-vdgzp    1/1     Running   0          2d
pod/nginx-deployment-prod-7884c86759-77t56   1/1     Running   0          2d
pod/nginx-deployment-prod-7884c86759-g6w2w   1/1     Running   0          2d
pod/nginx-deployment-prod-7884c86759-h99js   1/1     Running   0          2d

NAME                    TYPE           CLUSTER-IP      EXTERNAL-IP                            PORT(S)   AGE
service/kubernetes      ClusterIP      172.30.0.1                                       443/TCP   28d
service/nginx-service   ClusterIP      172.30.19.214                                    80/TCP    3d21h
service/openshift       ExternalName             kubernetes.default.svc.cluster.local       28d

NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/nginx-deployment-dev    3/3     3            3           2d
deployment.apps/nginx-deployment-prod   3/3     3            3           2d

NAME                                               DESIRED   CURRENT   READY   AGE
replicaset.apps/nginx-deployment-dev-796fc954bc    3         3         3       2d
replicaset.apps/nginx-deployment-prod-7884c86759   3         3         3       2d

If you look at the namespace, you may assume that using the all command, as shown below, would give you all of the resources you need to consider. It returns the most common resources:

  • Pods
  • Services
  • Replication controllers
  • Deployments / DeploymentConfigs
  • ReplicaSets
  • StatefulSets
  • DaemonSets
  • Jobs
  • CronJobs
  • Routes

It does not, however, return other common resources that your application may depend on, such as these:

  • ConfigMaps
  • Secrets
  • PersistentVolumeClaims (PVCs)
  • PersistentVolumes (PVs)
  • Custom resource definitions (CRDs)

Automated Kubernetes Data Protection & Intelligent Recovery

Perform secure application-centric backups of containers, VMs, helm & operators

Use pre-staged snapshots to instantly test, transform, and restore during recovery

Scale with fully automated policy-driven backup-and-restore workflows

Using the api-resources subcommand, you can see all of the available resources. This will list everything available, but you must determine the relationship between your application and the resources.

NAME                   SHORTNAMES         APIVERSION         NAMESPACED     KIND
bindings                                     v1               true         Binding
componentstatuses         cs                 v1               false        ComponentStatus
configmaps                cm                 v1               true         ConfigMap
endpoints                 ep                 v1               true         Endpoints
events                    ev                 v1               true         Event
limitranges               limits             v1               true         LimitRange
namespaces                ns                 v1               false        Namespace
nodes                     no                 v1               false        Node

In addition to namespace grouping, applications can be grouped based on a label, as shown below.

$ oc get deployments --show-labels
NAME                    READY   UP-TO-DATE   AVAILABLE   AGE   LABELS
nginx-deployment-dev    3/3     3            3           2d    app=nginx,env=dev
nginx-deployment-prod   3/3     3            3           2d    app=nginx,env=prod

Labels can be used to group different versions of the application within the same namespace. For example, if you had a dev and prod version of the application running in the same namespace, you could choose to only back up each application.

Watch this 1-min video to see how easily you can recover K8s, VMs, and containers

Step 2. Export the resources

Once you have identified the required resources, you need to export them to a file for backup. 

The most common file formats are YAML or JSON. 

You can use the oc command-line tool to export the resource to a YAML or JSON file, but the resulting file will contain a lot of unnecessary data, such as creation times or resource versions. 

To address this issue, there is a neat sub-command that will remove any extra metadata and clean up the output.

For example, these are the first ten deployment lines without using neat:

$ oc get deployment nginx-deployment-prod -o yaml | head -n10
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"labels":{"app":"nginx","env":"prod"},"name":"nginx-deployment-prod","namespace":"default"},"spec":{"replicas":3,"selector":{"matchLabels":{"app":"nginx"}},"template":{"metadata":{"labels":{"app":"nginx","env":"prod"}},"spec":{"containers":[{"env":[{"name":"MY_SECRET_KEY","valueFrom":{"secretKeyRef":{"key":"my-secret-key","name":"nginx-secret"}}}],"image":"nginx:latest","imagePullPolicy":"IfNotPresent","name":"nginx-prod","ports":[{"containerPort":80}],"volumeMounts":[{"mountPath":"/etc/nginx/default.conf","name":"config-volume","subPath":"default.conf"}]}],"volumes":[{"configMap":{"name":"nginx-config"},"name":"config-volume"}]}}}}
  creationTimestamp: "2024-05-17T21:12:37Z"
  generation: 2
  labels:

These are the first ten lines of the same deployment using neat:

$ oc neat get deployment nginx-deployment-prod -o yaml | head -n10
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  labels:
    app: nginx
    env: prod
  name: nginx-deployment-prod
  namespace: default

However, doing this by hand means that this process must be repeated for each application resource, which often will not scale. Other options include keeping your resource files in a version control system like Git. If you store your resources this way and ensure that no additional changes have been made on the cluster side, this will also work as a backup method for these resources.

Step 3. Back up persistent data

Once the OpenShift resources have been backed up, the next step is to address any persistent data your application generates, such as databases and user-generated content.

Your persistent data should already be stored on a persistent volume. In OpenShift, CSI snapshots are a supported feature when using the provided CSI drives. If you use a community-supported CSI driver, check the documentation to ensure that it supports snapshots.

You can create a snapshot by creating VolumeSnapshotClass and VolumeSnapshot objects. A VolumeSnapShotClass describes the storage classes for snapshots, and a VolumeSnapshot is a snapshot of a persistent volume. Together, they create a snapshot of your data. The snapshot can be restored by creating a new persistent volume and pointing the data source to your snapshot. 

Here are some examples of a VolumeSnapshotClass and VolumeSnapshot:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-snapshot-class
driver: csi-driver
deletionPolicy: Delete

(VolumeSnapshotClass)

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: pvc-snapshot
spec:
  volumeSnapshotClassName: csi-snapshot-class
  source:
    persistentVolumeClaimName: my-pvc

(VolumeSnapshot)

If you are making a snapshot of database data, you must perform certain actions first, like flushing the database. The exact steps are dependent on your database, but most will require some steps to flush any cached data and pause any writes to the database before creating the snapshot. This is important because you do not want to make a backup of data midway through a write—that will create a backup that more than likely will not be usable.

Note that while these steps can be performed manually, any good backup tool will allow you to add “hooks” into the snapshot process that can automatically perform the predefined steps.

For an in-depth look at the manual steps, refer to the OpenShift documentation, which provides full instructions on creating, restoring, and deleting snapshots.

Learn about the features that power Trilio’s intelligent backup and restore

Step 4. Backup etcd

The etcd database keeps track of the state of all resources in the cluster. Making a backup of the etcd database is a bit more involved than doing other resources because you need to debug into one of the control plane nodes and run scripts to dump a database backup.

One good example of where an etcd restore might come in handy is if you need to restore numerous deployments, but the teams that manage those deployments made changes via the UI and do not have the most up-to-date version of the YAML file. A drawback here is that restoring the etcd database could have unintended consequences and is usually a last resort, so this is also where a dedicated backup and restore tool like Trilio comes in handy.

In the latest version of OpenShift (4.15), automated backups of etcd are still in tech preview, which means they are not supported and require extra steps to make them work.

You can find full documentation on how to create etcd backups in Red Hat’s OpenShift Documentation. Trilio also provides an etcd plugin to help with the backup and restore process.

Security and compliance considerations

These are three primary considerations when it comes to security:

  • Use RBAC to limit/audit access: Institute security precautions regarding who can access your data. OpenShift has native role-based access control (RBAC) features that allow you to restrict access to your data and audit any actions performed on it. 
  • Make backups immutable: Backups should be immutable, meaning that they cannot be altered or deleted once created. This is a preventative measure against ransomware altering or encrypting your backups to hold your data hostage.
  • Encrypt your data: Use encryption keys to encrypt your data as it is being backed up. You should be able to use your own encryption keys as part of this process to ensure that only you can decrypt your backups.

These three considerations set the context for the following two best practices.

Use role-based access controls (RBAC) to limit and audit access

Role-based access control (RBAC) is a built-in user and group permissions system for OpenShift. It enforces policies that limit which users or services can access which resources. 

In OpenShift, there are two types of roles: a user/group-level role and a cluster-wide role. Your approach to RBAC controls will depend on your organization’s needs. If you have multiple tenants that manage their own resources, you will want to create per tenant users or groups to manage the resources. On the other hand, if you have a dedicated team managing your backups, you may want to create a cluster-wide role that has restricted access to all namespaces.

Here is an example of a role that defines what actions can be taken on what resources:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: development
  name: dev-team-role
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps"]
  verbs: ["get", "list", "watch", "create", "update", "delete"]

When you create a role, you will need a RoleBinding to associate a user or group to a role:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: dev-team-binding
  namespace: development
subjects:
- kind: Group
  name: dev-team
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: dev-team-role
  apiGroup: rbac.authorization.k8s.io

In this example, these two files set up permissions so that members of the dev-team group can take actions (verbs) on the specific resources (get, list, watch, create, etc).

In addition to controlling access to your resources, you can also keep an audit trail of who does what within your cluster. Here is an example of an audit log entry that shows the system:admin user listed the available Trilio policies:

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "Metadata",
  "auditID": "e54527ce-02d1-426e-82e9-79f651dfa9da",
  "stage": "ResponseComplete",
  "requestURI": "/apis/triliovault.trilio.io/v1/policies?limit=500",
  "verb": "list",
  "user": {
    "username": "system:admin",
    "groups": [
      "system:masters",
      "system:authenticated"
    ]
  },
  "sourceIPs": [
    "10.0.91.226"
  ],
  "userAgent": "oc/4.15.0 (linux/amd64) kubernetes/62c4d45",
  "objectRef": {
    "resource": "policies",
    "apiGroup": "triliovault.trilio.io",
    "apiVersion": "v1"
  },
  "responseStatus": {
    "metadata": {},
    "code": 200
  },
  "requestReceivedTimestamp": "2024-05-22T01:37:53.130266Z",
  "stageTimestamp": "2024-05-22T01:37:53.133197Z",
  "annotations": {
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": ""
  }
}

Trilio creates RBAC roles that help manage the cluster at both the cluster and namespace levels. This ensures that each user can only see their own resources within Trilio’s interface. You can read more about how Trilio uses RBAC on its documentation page.

Make backups immutable and encrypt them

After protecting who has access to your data, you need to protect the data itself against the possibility of someone unauthorized accessing it. 

There are three steps to this best practice:

  1. Pick a storage target such as S3.
  2. Encrypt your backups as they are being made.
  3. Make the backup immutable.

The first step is to decide what off-site location you want to use to store your backups. You should pick a provider that can enforce immutability and ensure that your data cannot be modified or deleted once saved. Amazon S3 can be a good choice for this.

Second, you want to encrypt your backup data. The biggest concern here is encrypting sensitive information from your database. If you are creating your backups manually, you have options such as GnuPG to encrypt your files. This works but is clunky, and as part of a best practice, you will want to streamline your process.

With Trilio, you can enable encryption when you create the backup and use keys provided by the user, which are saved as secrets on your cluster. When you create a backup with Trilio, it makes a snapshot of a PV and converts it to the QCOW2 image format. Using a widely used format like QCOW2 makes the data easier to move and integrate into other processes. Trilio uses LUKS encryption, a common and well-tested technology, to protect the QCOW2 images.

The last part is to make your backup immutable. Trilio’s documentation provides more information about immutable backups.

One last, often overlooked aspect of this is data sovereignty. Depending on your location and its applicable compliance regulations, there may be restrictions on where your data can reside or under what conditions it needs to be kept. Consult any applicable regulatory websites for information on data protection and storage.

Test data restoration

Part of any backup plan is verifying that the data is good and can be restored quickly. This is achieved by continuously testing your backup process and having it well documented so it is easy to follow during an ongoing data incident.

Testing your restore process regularly and quickly identifies issues such as incomplete backups, corrupted data, missing data, or any general oversight in the backup process.

Create a backup/test schedule

When you make backups and how long you keep them can be explored in an article all its own. In this section, we will scratch the surface.

You want to establish a process for restoring your data that reflects your backup type and schedule. How often and how you backup will determine your testing schedule.

When talking about how to schedule your backup, other than a full backup, there are two types of backups to consider. A full backup is just like it sounds: a backup of all your data. How long this backup takes depends on the amount of data you have to protect. This can put a strain on systems resources and take hours.

A backup tool is useful when you want to optimize how you do backups after a full backup.

After a full backup, your two best options are to make an incremental backup or a differential backup. Both an incremental and differential backup will take less time than running a full backup and take up less disk space but in different ways:

  • An incremental backup saves only the data that has changed since the last full or partial backup. It takes minimal space since it only saves the data that has changed since the last backup. However, restoring can be complex since you have to restore the full backup and each incremental backup. 
  • A differential backup saves the data that has changed since the last full backup. It takes more space because it always includes the data that has changed since the last full backup. Restoring is easier because you only need the last full and incremental backup.

A common schedule might be to create full backups overnight on Sundays and then incremental or partial backups every night so that you are never too far away from doing a full recovery if needed.

If you are doing this manually, you will want to integrate your tooling into a cron job within OpenShift. The full documentation can be found in the OpenShift documentation for Jobs/CronJobs.

Trilio uses jobs and cron jobs as part of how it manages both scheduled backups and the retention policy for how long they are kept. You can explore their Backup Retention Policy docs for more information.

Knowing that your backup process works will build your team’s confidence, and you will know how long it would take to recover should you need to.

Continuous restores

One approach is to have a system that can continuously back up your data and then restore it to other clusters. This process not only benefits your cluster in the event of a disaster but also helps build test environments and a path from development to QA and production.

Continuous restore in action

Continuous restore in action

With Trilio, you can implement the practice of continuous recovery, where your data is backed up and nearly instantly restored to another cluster. This process can reduce the time it takes to recover from a loss because you have a cluster ready to go. You can read more about continuous recovery on Trilio’s website

Document the process

Finally, it is a best practice to document the entire process. This step is often overlooked because the person responsible for performing the restores is often also the only person responsible for documenting the process. Taking the time to document a process will often get deprioritized for more pressing work. 

For your documentation to be successful, it has to be able to be completed by someone who can follow the plan step by step. That individual must be able to perform a backup, restore a backup, and verify that the data is correct.

Here are some questions you’ll want to answer while writing your documentation.

Documentation question

Description

What data is being backed up? 

This could include which applications or tenants.

What is the backup schedule?

Weekly? Daily? Monthly?

What types of backups are you taking?

Full backups? Full and incremental?

How do you perform a backup?

Step through the backup process step by step.

How are backups verified?

Outline the steps needed to ensure the backup is correct.

Where are the backups stored?

Document where the backups should be stored.

How are backups secured?

List the ways that the backup data is protected.

How do you restore a backup?

The restore process is just as important as the backup process.

How long are backups kept?

It is important to know how far you can go back.

Who are the points of contact for escalation?

If need be, whom can you call?

Learn about a lead telecom firm solved K8s backup and recovery with Trilio

Conclusion

This article covered five best practices for backing up your OpenShift cluster and protecting your data. Let’s review the best practices once more so they are top of mind when you plan to back up your OpenShift cluster.

As the article demonstrated, creating an application-centric backup is a complex task. Still, it should be at the core of your backup strategy. It involves understanding the intricate relationships and dependencies among the many moving parts of your OpenShift resources. Remember that your ability to group your resources together via labels that make sense for your organization and to correctly back up your persistent data is key to this challenge.

Using role-based access controls (RBAC) to limit who can access your data can help keep it safe. Additionally, you can audit who has access to your data and resources should you ever need to provide that reporting.

Continuing with data protection, encrypting your data, making it immutable, and storing it off-site away from the cluster is vital. These practices can ensure your data remains secure in the event of deletion or ransomware.

Backing up the data is only half of the process. For a backup to be successful, you must be able to restore the data quickly when needed. Confidence will come from regular testing to ensure that the restores have the correct data.

Often, when you need to backup or restore data, you are in a stressful situation. This is why our final reminder is to document the whole process. 

Trilio implements all of these best practices and can help you develop a complete plan for backing up and protecting your data within an OpenShift cluster.