Key Concepts and Best Practices for OpenShift Virtualization

According to the OpenStack 2023 User Report, OpenStack has seen remarkable growth, now boasting a global footprint of 45 million cores. This continued adoption highlights the critical need for robust backup strategies to protect virtual machines (VMs) and the entire OpenStack ecosystem.

Adopting OpenStack opens the door for discussing how to back up an OpenStack deployment effectively. This includes examining not just backing up the data on VMs but also how to ensure that OpenStack can withstand failure as an entire platform. Several use-case-specific configurations can be applied to each OpenStack service, and each service adds a layer of complexity that must be considered and protected because they interact cooperatively to process computational workloads. For example, recovering your VMs from all your OpenStack projects is important. Still, if each machine’s associated floating IP addresses are lost, then a significant amount of time and effort will be needed to return to the status quo.

This article discusses the importance of backup strategies in OpenStack environments and delves into key backup methods. We also explore the concept of comprehensive backups, ensuring that they extend beyond just protecting volumes.

Summary of key OpenStack backup concepts

Concept

Description

The importance of comprehensive backups

Comprehensive backups are essential for holistic data protection in OpenStack. Beyond just volumes and disks, backups should encompass configuration files, databases, and even the underlying infrastructure to ensure full recovery in case of failures or disasters.

OpenStack built-in services

OpenStack provides built-in services like Cinder for block storage snapshots and Ceph for object storage backups. Leveraging these services along with best practices like periodic backups and testing can contribute to a robust data protection strategy.

Challenges with built-in services

OpenStack’s built-in services offer basic backup capabilities, but they have limitations. Cinder snapshots can create dependencies on the original volume, and recovering from a full environment failure can be complex. Additionally, managing backups across multiple services requires careful coordination and may not provide a unified view of the entire environment. 

Strategic backup solutions

Trilio’s intelligent backup solution platform addresses the shortcomings of built-in services by reducing failure anxiety and improving deployment resilience. 

The importance of backups: “Hope is not a strategy”

Site reliability engineers repeat this phrase for a good reason: In any OpenStack deployment, the risk of failure will always remain. File corruption, natural events, or even broken water cooling pipes in your data center can spell disaster for your cloud and its users. Receiving a panicked phone call over the holidays is never ideal.

The question is: How can this be minimized? In an OpenStack environment, how can we harness the right tools to make a traditionally hectic scenario more manageable if a failure occurs?

Large OpenStack implementations, such as the one at CERN, benefit from having multiple cloud regions that can fail over. To put this into perspective, CERN runs 300,000 cores, using 850.2 TiB (terabytes) of memory and 14.6PiB (petabytes) of storage in OpenStack, with more than one availability zone.

CERN OpenStack resource usage (source)

Despite smaller deployments having impressive Ceph file systems that can withstand a number of their nodes failing, they are only partially protected from an on-site disaster that can result in permanent data (and reputation) loss. Whether it’s industry or academia, this is never a good look. Let’s explore what we can add to OpenStack to avoid this.

Find out how Vericast solved K8s backup and recovery with Trilio

What does OpenStack provide out of the box?

OpenStack comes with a block storage service called Cinder that allows users to create block storage for their VMs, take snapshots of their virtual machines, and back up essential volumes on their hardware. In the event of a failure, the team can restore the data from their machines using this snapshot function or the backed-up volume.

The diagram below shows an example of an OpenStack deployment with the essential services for a cloud.

A typical OpenStack ecosystem (source)

A typical OpenStack ecosystem (source)

What are Cinder’s limitations?

Cinder is a fantastic tool for creating backups or snapshots of your machines to ensure that you can plug the gap if a failure occurs. However, one of the main limitations of Cinder is the parent-child dependency when it comes to snapshots.

Taking a snapshot is excellent, but what happens when you lose or delete the snapshot’s original volume? Well, you’re between a rock and a hard place. Without the original volume, the snapshot cannot boot or restore effectively. Even creating a new machine based on the snapshot image will fail since it cannot locate the original volume. 

This can be overcome by promoting the snapshot to a backed-up volume, which removes the dependency on the original volume. From here, a custom image can be created and deployed, or the volumes can be kept for use in an emergency.

Outside Cinder, how can I minimize a failure event?

To protect your infrastructure against failure, there is always the option to reduce the impact the failure will have. In OpenStack, if you’re limited to just OpenStack services, you can always take advantage of distributing your critical VMs across your hypervisors—OpenStack will naturally do this. However, if you have two critical workloads and notice that they are on the same hypervisor, for example, moving one of those workloads to another node via live migration would be a good idea to help reduce the impact of failure.

For those unfamiliar with the live migration feature in OpenStack, it allows you to move VMs to different hypervisors without downtime. In this scenario, if one of the hypervisors fails but one of the Active Directory controllers is accessible and has replication enabled, users can still log in and receive their group policy configurations, which means work can continue (albeit in a limited way).

Masakari is also another option. This is an OpenStack service that can help detect VM failure events and migrate the VMs to another node, which is especially useful for high-availability applications. Masakari uses its API service to manage and control instances of hypervisor failure.

What about backing up everything else?

As far as configurations go, outside the basic OpenStack services, no platform currently saves the configuration of the OpenStack cloud beyond storing it in a Git repository. The Keystone authentication service would require new users to be created, the Neutron service would need to be set up again to create your networks, and you would have to recreate your projects, their allocated subnets, custom flavors, images in Glance, and security groups. This can spell disaster, especially if you have many custom images, complex networks, and numerous VMs in your organization.

If a large distributed deployment encompasses more object-storage-specific services such as Swift, backing up the Swift configuration is also essential. File corruption or accidental deletion in the /etc/swift directory can make your cluster’s data inaccessible if it isn’t backed up. It is a best practice to spread this directory across your storage nodes, but this does not eliminate the risk of it being deleted altogether.

More recent deployment methods, such as OpenStack Kolla Ansible, can make the redeployment of OpenStack much smoother. OpenStack Ansible works by deploying via Ansible playbooks, meaning that the user just needs to run a few lines of code to get a working environment again. However, this would require a great deal of panic work if a deployment experienced a catastrophic event. 

Bridging the gap with Trilio

Due to there being so many moving parts in an OpenStack deployment, the distributed services architecture of OpenStack can make it challenging to perform a well-orchestrated backup of entire OpenStack clouds. Even though services do exist that enable OpenStack to be backed up natively, it is an overly complex process and requires a deep technical understanding of each service involved. To remedy this, the talented team at Trilio has developed a data protection as a service (DPaaS) product for OpenStack that simplifies the operation to a few clicks. Trilio started as the Raksha project in the early days of OpenStack and later evolved into its own service. Trilio can be deployed to many OpenStack distributions, such as Red Hat OpenStack Platform, Canonical OpenStack, Mirantis OpenStack Platform, Mirantis OpenStack on Kubernetes, and Kolla Ansible.

Trilio allows users to back up their critical workloads with complete autonomy. It enables the recovery of entire network topologies, projects, single VMs, or their respective files and folders. There is also the option to backup more discrete configurations, such as Swift configurations. Trilio integrates seamlessly into the Horizon UI and operates like an additional OpenStack service.

Backups within Trilio are defined as workloads, which can be a set of critical machines within an OpenStack project. A workload can include many features, such as a backup policy defining a specific type of backup—from complete to incremental—and a plan for when these backups occur and their frequency.

Learn about the features that power Trilio’s intelligent backup and restore

Trilio in action

To better understand how Trilio works, let’s demonstrate it through a demo. Below, you can see an example of a typical OpenStack user interface, with compute, volumes, network, and object stores but with an additional tab called Backups

The modular nature of OpenStack and how Trilio has been developed allow seamless integration between Trilio and OpenStack. This means you do not have to deal with a whole different user interface or timely integration for your backup solution, as is the case with most vendors.

OpenStack Instances tab within Horizon

If we look within the Backups tab, we see workloads. As mentioned earlier, workloads are just backup jobs for one or more VMs. You can create and define your own unique workloads, which specify backup frequency, type, and other actions such as encryption. (Note that encrypted backups take slightly longer than non-encrypted ones.)

Trilio Workloads and workload types in Horizon

For this example, we can see the Production Workload details, including the name, description, availability zone, and the names and IDs of the VMs within this workload.

Trilio’s Workloads tab

Trilio’s Workloads tab

Now, let’s see how Trilio works. For this example, we will remove two of our VMs and then restore those machines. This is typical of an accidental deletion scenario or something comparable.

The machines and their associated volumes are now being deleted in the image below. This setup does not have other features, such as soft delete, so once something is deleted, it goes into the void forever.

Deleting prod-1 and prod-2 VM instances in Trilio

The VMs and their associated volumes have now been removed.

The VMs and their associated volumes have now been removed.

Instances scheduled for deletion

With these instances deleted, we can open the Trilio backup tab within OpenStack and navigate back to our workload that backs up the VMs and their volumes.

The Trilio Workload interface

The Trilio Workload interface

This is where some of the intelligent recovery of Trilio comes in. We navigate to the Snapshots tab within our Production workload and can navigate to our incremental or full backup snapshots. These snapshots are defined within the workload policy. 

Let’s recover our VMs and see what happens.

One-click restore of the previously deleted VMs

One-click restore of the previously deleted VMs

The one-click restore has been activated. Trilio will now begin to collate all the ingredients that the backup had from the metadata for those VMs and begin to rebuild the VMs.

VM information collated from Trilio and reprovisioning

VM information collated from Trilio and reprovisioning

The VMs, including all the IP addresses, flavors, key pairs, and machine instance names, are now being recovered. This is incredibly useful if you need to get back up and running quickly and represents a good form of insurance against accidental VM deletion.

In this example, we fully recovered our VMs by utilizing incremental backups stored within Trilio to provide a quick, intelligent recovery mechanism. Trilio also allows for more granular actions, such as restoring a specific file or folder. 

Recreating previously deleted instances with their selected flavor, keypair, and storage

Recreating previously deleted instances with their selected flavor, keypair, and storage

Bonus: Network topology migration & restore in OpenStack via Trilio

Another impressive feature within Trilio is the ability to recover or migrate networks across different tenants. Creating networks and subnets and allocating IP addresses in OpenStack can be tedious, so using intelligent orchestration via Trilio to avoid time-consuming tasks such as this can be incredibly beneficial.

For this example, a new tenant called kevin-restore will be used to demonstrate the power and simplicity of migrating Trilio Workloads between tenants. Here, we can see that there are no existing instances or networks.

A blank network topology, only containing the external network

A blank network topology, only containing the external network

First, the specified workload needs access to the new kevin-restore tenant, but it is currently assigned to the previous kevin-demo tenant. Trilio can address this through the WorkloadManager CLI tool. This tool operates in the same fashion as the OpenStack SDK: It allows commands to be run that aren’t offered in the Horizon interface, such as reassigning a workload to another tenant.

Utilizing Trilio’s Workload Manager CLI tool to change ownership

Utilizing Trilio’s Workload Manager CLI tool to change ownership

Okay, now the workload’s ownership has changed to the kevin-restore tenant and is available within the Workloads section.

Changed ownership within the project

Great. From here, we can begin recovering our entire network topology. We can emphasize the network topology recovery by utilizing the selective restore feature within Trilio.

Trilio’s Workloads tab

Starting a selective restore in Trilio

Restoring the network topology

Specifying the workload to restore the network topology

Specifying the workload to restore the network topology

Once we have clicked Restore, the process will start. An excellent visual of this orchestrated recovery in action can be seen in the Network Topology graph in OpenStack. Here, we can see that our networks, subnets, and VMs are being recreated through a few simple clicks. The final result is the VMs being spun up with the same IP addresses, flavor, keypairs, and instance names that had been present in the other tenant.

Restoring the topology to the new project in real time

Completing migration of the network topology and instances

Automated Kubernetes Data Protection & Intelligent Recovery

Enjoy native OpenStack integration, documented RESTful API, and native OpenStack CLI

Restore Virtual Machines with Trilio’s One-Click Restore

Select components of your build to recover via the OpenStack dashboard or CLI

Conclusion

OpenStack has several useful tools for backup to minimize failure and enable recovery, providing insurance against unexpected deletion or failure. However, having intricate knowledge of Ceph, OpenStack, and its underlying services can leave some teams with incredible technical debt or knowledge gaps. This is where Trilio shines: You don’t have to be an expert or wait in a vendor’s ticket queue when you have a severe incident. The team can perform an orchestrated backup and recovery in-house. 

Trilio enables technical teams to be more autonomous. A few clicks within Trilio’s integrated backup tab are enough to kick off the restore necessary for the selected workload. This makes teams much more effective in a shorter amount of time during disaster recovery scenarios and keeps critical systems running with minimal downtime.

In a world where uptime and data integrity are paramount, leveraging tools like Trilio can provide a well-needed strategic advantage, allowing your organization to maintain resilience and reliability in the face of unexpected challenges.