OpenStack clusters, like any other complex system, are vulnerable to a wide range of disaster scenarios, from a single-node outage to a complete site outage. Since every production-scale OpenStack environment has built-in redundancy for its services, machines, and cables, the recovery process usually depends on how you plan for these redundancies and the approach you follow to recover.
The typical assessment questions are:
- How many controllers are in your cluster?
- How many nodes per availability zone (AZ) are in your design?
- Which data should you backup and how frequently?
- Do you have documentation and training for your program?
This article explains some of the core concepts and relevant recommended practices for designing control and data plane redundancy, making effective backups of the critical data and configuration, and finally, not overlooking a disaster recovery program’s process and people aspects.
Summary of key OpenStack disaster recovery core concepts and best practices
Core concept / best practice area | Description |
Disaster scenarios and their potential risks | A variety of different disaster scenarios can arise in OpenStack, including service, controller, compute node, network, and instance failures. |
Controller redundancy | Implement redundant controllers (VMs or containers), distribute services evenly, and maintain base image backups for control plane resilience. |
Compute node redundancy | Mitigate the risk of compute node failures by ensuring sufficient redundancy with spare nodes. For critical workloads that cannot tolerate downtime, consider implementing instance HA policies to automatically recover and restart instances on healthy nodes. |
Network redundancy | Use multiple subnets and NICs to segregate control plane traffic and enhance network redundancy and security. |
Base image backups | Ensure consistent recovery and rapid deployment of overcloud controllers by maintaining backups of base images, configurations, and snapshots. |
OpenStack instance recovery | Leverage heat orchestration templates and regular backups to streamline OpenStack instance recovery and ensure data protection in case of failures. |
Creating a disaster recovery program | Be sure to encompass not only technical solutions but also well-defined processes and trained personnel for seamless recovery. |
Disaster scenarios and their potential risks
Before discussing the key factors that can help you formulate a DR program, let’s start with understanding some real-life scenarios that can cause interruption to overcloud services.
Service failure
Most of the time, services stop working due to software issues, operating system bugs, or failed OpenStack upgrades. This could include OpenStack overcloud services such as Cinder, Nova, Database, and Keystone.
The impact on the instances can vary depending on the service. For example, if there are any ongoing instance deployments, the failure of any overcloud service can lead to deployment failures.
Scenario highlighting a single OpenStack overcloud service failure
Automated OpenStack Data Protection & Intelligent Recovery
Enjoy native OpenStack integration, documented RESTful API, and native OpenStack CLI
Restore Virtual Machines with Trilio’s One-Click Restore
Select components of your build to recover via the OpenStack dashboard or CLI
Controller failure
Hardware failures are very common in data centers and sometimes can cause the complete outage of a system. Controller nodes, whether virtual or physical, can go out of service in these situations.
Depending on the design, controller node failure may not impact the working instances and their data plane traffic, but they can affect the administrative jobs done through the OpenStack agent. Also, due to the loss of the database hosted on the failed controller, instance or service information may be permanently lost.
Scenario highlighting the failure of a controller node
Compute node failure
The most common problem in OpenStack clouds is compute node failure. Disk issues and other types of hardware failures are mostly the cause of compute node failures.
Illustration of a compute node failure
The apparent risk involved in this scenario is the loss of instances and their disk data if they are using local storage.
Network failure
Bad small form-factor pluggable (SFP) connectors, faulty cables, network interface card (NIC) issues, and switch failures are all possible reasons for losing network connectivity on the data or control plane.
Scenario highlighting a network failure
Any network failure involving data-plane NICs directly impacts the instances using those NICs. However, for control-plane network failure, the impact is tangible if there are pending tasks such as a reboot, migration, evacuation, etc.
Learn how Trilio’s partnership with Canonical helps better protect your data
Instance failure
OpenStack instances, whether standalone or an application node, are always subject to failures. Human errors, host disk failure (or other types of hardware failure), power outages, and other issues can cause various problems, such as data loss, VM downtime, and instance deletion.
While losing a cloud node is a severe failure, instance failures can occur for various reasons. This can often require redeployment of the instance or, in some cases, the entire stack.
Scenario highlighting a single VM instance failure
Controller redundancy
OpenStack’s most basic design consideration is having a cluster of multiple controllers (usually three or more in odd numbers, such as 5, 7, etc.). Typically, a minimum of three controllers are deployed in a cluster to maintain quorum and ensure system consistency in the event of a single server failure. Quorum ensures that the cluster can continue to operate and make consistent decisions even if one or more nodes fail. In a cluster with an even number of nodes, a split-brain scenario could occur where two halves of the cluster believe they are both in charge, leading to inconsistencies and potential data corruption.
There are different standard practices for managing controller redundancy:
- Three (or more) bare metals, on top of which there are containers, each hosting one service. For example, one server might run the nova-scheduler service, another the keystone service, and so on. This approach offers isolation between services, potentially enhancing security and simplifying troubleshooting.
- All control plane services are hosted together on each of the three (or more) servers. This replication of services across multiple servers simplifies deployment and management, as each server can be treated as a self-contained unit. In the event of a server failure, the remaining servers in the cluster continue to provide the necessary services, ensuring minimal disruption. These control plane servers can also be virtualized, which can provide additional management benefits such as fault tolerance, ease of backup, and ease of scaling.
- Using Kubernetes to manage the OpenStack control plane services as containerized workloads. The approach enables easier scaling of individual services based on demand while offering self-healing mechanisms to automatically recover from failures.
For most OpenStack webservices, a load balancer like HAProxy or NGINX is the primary tool for ensuring high availability and distributing traffic among controller nodes. In addition, there are many choices for managing controller redundancy, such as Pacemaker for powerful features such as load balancing among nodes, migrating services in case of failures, and having redundancy for the control plane services to ensure high availability.
Pacemaker can be used in conjunction with a load balancer to manage the endpoint IP, ensuring that it always points to an active and healthy controller node. Although for services like the database or message queue, additional measures are necessary to ensure high availability. For example, a Galera cluster can be used for the database to provide multi-read/write capabilities, while the message queue can be configured in a distributed mode for redundancy.
Hosting controller services through containers and VMs
As a recommended practice, choose containers and VMs over bare metals to host controller services because it is much easier to make backups and snapshots and do redeployments in case of significant hardware failures.
Usually, the ideal approach for minimizing resource consumption by the OpenStack control plane is to ensure that all controllers distribute the services equally. However, mission-critical services like databases require special consideration. Besides backing up databases at regular intervals, maintain multiple synchronized database instances, with one designated as the master and the others as slaves. It is also important to ensure that the database backup is always transferred to multiple locations outside the cluster for emergency cases. One idea here could be to include the OpenStack config files and other critical information that assure the functionality of a controller node in the same backup file as the database.
Below, you can also see an example of service distribution on controller nodes. The services’ names in each controller mean that those services are active on that controller node and handle incoming requests—it does not mean that the rest of the services not mentioned in the controller are absent.
Controller node service distribution
A distributed node service design ensures that one node is not overloaded and does not affect performance. For simplicity, while troubleshooting, you can force specific services to be hosted on one controller, so it is easier to spot a failure. For example, Nova services (API, scheduler, or conductor) can be run in one node.
One last item to mention is how to redeploy the controllers in a disaster. Depending on what tool you use to deploy the operating systems on your nodes (e.g., LCMs), you should always make sure to have a backup of the base image used to deploy the controllers (the image usually includes packages and libraries needed for the basic functionality of the controllers). Taking periodic snapshots of the controllers and transferring them to safe locations would be a great way of recovering from a disaster without losing too much info or spending too much time restoring a backup.
Compute node redundancy
Depending on the capacity of the over-cloud compute nodes and the criticality of the instances running on them, there should always be room for at least one compute node failure so that all instances of the failed node can be evacuated to the spare node. If multiple compute node groups have different capabilities, such as CPU architectures, single root I/O virtualization (SR-IOV), data plane development kit (DPDK), etc., this design must be more granular to focus on finer details of individual components.
The best practice is to subdivide your nodes into multiple host-aggregates and then assign one or more spare compute nodes with the same capabilities and resources into those aggregates. Needless to say, these compute nodes must be empty of load to host the instances of the failed compute node. In addition, Availability Zones (AZs) can be mapped to these host aggregates, allowing users to select where their instances are deployed based on their requirements. If a compute node fails within an AZ, the instances can be seamlessly evacuated to the spare node(s) within the same AZ, minimizing disruption and maintaining service continuity.
The fencing mechanism and instance high-availability (HA) policy options are optional features that can be used to mitigate the impact of compute node failures, specifically for mission-critical deployed services that cannot tolerate any downtime due to such failures.. By defining specific HA policies for the instances, you can determine what happens to them if the underlying host goes down. Usually, the downtime of an instance cannot be tolerated. In that case, the applicable HA policy is ha-offline, which technically evacuates the instance into another compute node (the spare node). It is important to note that the fencing agent must be enabled in Nova for this scenario to work. More information on high availability for compute instances can be found here.
Another important consideration is handling OpenStack instances with local disks. Based on the probable incidents for compute nodes, there are usually two types of failures that can affect VMs that are not using Cinder or separate block storage:
- Disk failure on a compute node: In this case, a good practice is to have raid arrays that can tolerate the failure of a single disk.
- Compute node failure: To mitigate this, it’s recommended that you have a backup of the disk image (usually a qcow2 file) for instances with sensitive data.
Please also remember that for disaster recovery purposes, using Ceph (via RBD) which bypasses Cinder, or Cinder disks for VMs hosting critical services is much safer and more efficient. Both approaches ensure that even if the local storage of a compute node fails, the instance’s data remains intact and can be easily attached to another compute node, minimizing downtime and potential data loss. However, using Ceph directly via RBD offers the added benefit of bypassing the Cinder layer for potentially improved performance. Ceph’s distributed nature and self-healing capabilities, in particular, offer additional resilience against hardware failures. Additionally, because the instance’s metadata is stored separately, it can be used to launch the failed instance on another compute node, ensuring a seamless recovery process.
Learn about the features that power Trilio’s intelligent backup and restore
Network redundancy
OpenStack networking requires different deployment strategies depending on how busy the cloud is and which services are used the most. For big production sites, the best practice is to have separate redundant NICs dedicated to a specific job: control plane traffic, data plane traffic, storage traffic, and Integrated Lights-Out (iLO).
These NICs usually consist of two interfaces working in an active-standby manner inside a Linux bond or aggregate, with each connected to a separate leaf/access switch for more redundancy. However, when it comes to SR-IOV-capable NICs, which are used for data-plane traffic, a recommended practice is having two VFs attached to a single instance—one for each port of the NIC—for guaranteed interface redundancy in case one goes down.
Other than redundancy for physical networking elements, it’s highly recommended to have backups of the switches’ images and configuration. These backups can be made by your lifecycle management (LCM) tool or any other native or external tools, and they can be used to immediately spin up a switch in case it fails and needs to be replaced.
The networking equipment’s redundancy and design strategy are beyond this article’s scope.
Controller services backup
When managing an OpenStack cloud, it’s important to plan for potential failures or redeployment of controller nodes, which host essential services that manage the entire cloud infrastructure. To minimize downtime and ensure a smooth recovery, backups of both controller node configuration and state are essential. It is recommended to back up the configuration files of key controller services, such as Keystone, Nova, and Neutron, as well as various settings, parameters, and customizations that define how each service operates. This includes the configuration files themselves, database schemas, and any scripts or modifications implemented.
As for preserving the state data, the ideal approach is to backup every dynamic information that services accumulate during operation, such as database entries, message queues, and cached data. For instance, a database snapshot can capture the state of a database at a specific point in time, while exporting message queue data can ensure that pending tasks are not lost.
While some OpenStack distributions or management tools offer automated recovery options, relying solely on these mechanisms can be risky. These tools might not always capture the full breadth of configuration and state data, potentially leading to incomplete recovery or requiring manual intervention. Instead, having a comprehensive backup strategy that encompasses both configuration and state backups provides a more robust and reliable means of restoring controller services to their precise pre-failure state.
Instance recovery
When a VM or its data is lost due to events like accidental deletion, hardware failure, or software corruption, restoring operations and minimizing downtime remains dependent on how fast your instance can recover. However, without adequate backups from the database or VMs’ boot disks, OpenStack may be unable to power on affected VMs, hindering the recovery process almost entirely.
To recover from such disasters, administrators should consider using Heat Orchestration Templates (HOT files) for the initial deployment of the VMs. A best practice is to group VMs that are part of the same application or service and create a dedicated template and environment file. This file should encompass all relevant VM characteristics, including flavors, networks, images, disks, availability zones, metadata, and more.
The benefits of utilizing an orchestration service like Heat for VM deployments are significant. Whether one VM or all of them are affected, Heat’s features enable recovery, redeployment, and updates of instances. In cases where VMs are deleted, Heat can restore them to their initial deployment state. However, it’s important to note that any changes or data added after the initial deployment will be lost. Therefore, maintaining regular backups of critical VMs or periodic snapshots of Cinder volumes is crucial for comprehensive data protection.
Modular HOT files for efficient deployment and management
Another effective strategy involves maintaining separate HOT files for different components of a specific application or service. For instance, network configurations (ports, IPs, MAC addresses, subnets, and VLAN/VXLAN networks) can be stored in a separate HOT file from the one containing VMs and other resources. This modular approach allows for isolated modifications to networking without impacting VMs, eliminating the need for full redeployment of an overcloud application.
Many OpenStack vendors offer their own orchestration tools that interact with Heat in the background. These tools often provide additional features and a graphical user interface (GUI) for simplified management.
Alternative backup strategies for overcloud instances
In scenarios where cloud users don’t require multi-node deployments, using HOT files might not be the most efficient approach. Instead, direct backups of instance images stored locally on the compute host or in a storage area can be more suitable.
However, due to access restrictions and potential unfamiliarity with virtualization engines, manual backups might not be practical for all users. It is highly recommended to use OpenStack native tools such as snapshots, or Trilio, which is more user-friendly and offers more features for making overcloud instance backups. The advantage of using Trilio, in particular, is that it is native to OpenStack and can be easily accessed from the admin panels, eliminating the need for external integration or connectivity.
Backup considerations for controllers, storage nodes, and undercloud
Using tools like Trilio is also recommended for controllers, storage nodes, undercloud nodes, and other infrastructure components. Even though most OpenStack deployments use LCMs for their node deployments, LCMs usually offer the base image and its initial configuration. In the event of a disaster, such as a storage failure, LCMs can restore the affected node to its initial state, potentially leading to the loss of any additional configurations or data.
Trilio can be particularly valuable for TripleO deployments, where all OpenStack nodes (overcloud and undercloud) are deployed on top of OpenStack. Periodic backups of controllers, storage nodes, etc., especially before or after major changes or upgrades, become easily manageable. Restoring these backups is also simplified, as it doesn’t require interaction with external LCM applications.
Creating a disaster recovery program
A disaster recovery plan’s people and process aspects are as important as the redundancy architecture and backup methodology. Consider the following factors to help formalize a disaster recovery plan:
- Business impact analysis (BIA): Perform a BIA to determine the critical applications and data essential for your organization’s operations. Identify the maximum acceptable downtime and data loss for each critical component and the resources required for recovery.
- Recovery objectives: Define clear recovery objectives, such as the recovery time objective (RTO) and recovery point objective (RPO), based on the BIA and your organization’s business requirements. The RTO specifies the maximum time to recover critical systems, while the RPO determines the maximum acceptable data loss.
- Documentation: Maintain up-to-date documentation for your disaster recovery plan, including detailed recovery procedures, contact information for key personnel, vendor relationships, and off-site storage locations.
- Testing and validation: Regularly test and validate your disaster recovery plan to ensure its effectiveness and to identify gaps or issues. This should include testing the backup and recovery processes and simulating various disaster scenarios to validate the recovery strategies.
- Training and awareness: Provide training and awareness programs for all relevant operations and engineering personnel, ensuring that they understand their roles and responsibilities during a disaster. Regular drills and exercises can help reinforce disaster recovery procedures.
Find out how Vericast solved K8s backup and recovery with Trilio
Conclusion
Disasters are an unfortunate reality but designing a robust and redundant OpenStack architecture can mitigate the impact of such scenarios. This applies whether it is a controller node, compute node, storage node, interface/cable, hard disk drive, or other type of failure.
This article reviewed the core concepts and related best practices for an efficient backup strategy of OpenStack. We also discussed the fact that OpenStack instances can take advantage of the HA-policy option based on the workload’s sensitivity. The HA-policy can be combined with different agents to perform a specific task on the instance or its underlying hypervisor. For example, the instances can be evacuated to the spare node, and the faulty compute node can be fenced so the scheduler cannot select it for the spinning instances.
For comprehensive backup and recovery of overcloud instances, controllers, storage nodes, and the undercloud, leveraging native OpenStack tools like Trilio is highly recommended. Trilio simplifies the backup process, offers additional features, and seamlessly integrates with OpenStack’s admin panels, eliminating the need for external connectivity or complex configurations.
Like This Article?
Subscribe to our LinkedIn Newsletter to receive more educational content