When you think big data you think Hadoop, Cassandra, MongoDB, and other scale-out NoSQL databases. All these applications share an important trait; they are scale-out applications. Scale-out applications are built to scale from single nodes to hundreds of nodes. They dynamically distribute the data between nodes. They also replicate data between nodes for improved read performance and guard against hardware failures.
So the prevailing thought is that these applications do not require backups as the redundancy built into these applications can protect against hardware failure. This is partly true, but there are other use cases that require point-in-time copies of these applications:
- Inadvertent changes of data, data loss, or data corruption. When this happens, data loss or data corruption is replicated to all copies of the data which will result in permanent data loss.
- Test and development require a copy of the production environment. In industries like financial, different teams run different hypotheses on big data so they need access to a point-in-time copy of production data.
But traditional backup solutions or even storage-based LUN snapshots do not provide a good solution to these architectures for various reasons:
- These architectures distribute the data across multiple nodes of the cluster. So the data consistency is spread across multiple nodes of the cluster. As the size of the cluster grows, the complexity of achieving a consistent backup of the cluster grows with it. Hence taking a single node or a LUN snapshot at a time does not provide a good backup solution.
- These architectures are also dynamic. When a new node is added or an existing node is removed, these applications rebalance the data between nodes to achieve optimal performance. So it is important that the application is stopped from performing resource-rebalancing tasks before taking a cluster snapshot.
- Incremental backups
- The big data deployments tend to be large, unless native tools support capturing incremental changes, performing full backups each time can be very expensive.
- The support for incremental changes in application-specific tools varies widely and developing a common solution for these applications can be challenging.
- Object-level restore
- Since data records are distributed across nodes, object(s) level restores require restoring the entire cluster, as it existed at the time of the backup operation. This requires taking back up of cluster configuration along with cluster data during the backup operation.
Applications such as Hadoop, MongoDB, and Cassandra provide either tools or APIs to support backup functionality. We will review each of these tools, their capabilities, and their shortcomings.
The complexity of scale-out applications architecture
Applications that can distribute data across nodes, known as scale-out applications have revolutionized scalability and performance. This architectural approach is commonly seen in Big Data analytics, and content delivery networks. Distributed databases. It provides a level of redundancy. Fault tolerance. However as these applications become more intricate and significant it is crucial to prioritize the protection of their data, against loss, corruption, or unforeseen events.
To effectively address backup and recovery challenges in scale-out environments it is essential to comprehend the intricacies of data distribution, replication, and sharding. The redundancy and fault tolerance in scale-out architectures present unique considerations for data protection that demand strategies and solutions.
Scale-out architectures distribute data across nodes to ensure availability and fault tolerance. Nevertheless, this characteristic introduces challenges when it comes to backup and recovery processes. Issues such as data fragmentation, distribution algorithms, and guaranteeing protection for each node’s data pose obstacles. Moreover maintaining an understanding of the distributed architecture plays a role, in developing efficient backup strategies that encompass all aspects of data protection.
Challenges with Data Consistency and Coherency
Ensuring data consistency in a distributed environment can be quite a challenge especially when dealing with applications that scale out. When data is stored and accessed across nodes it becomes crucial to maintain coherence among all copies of the data. This becomes more critical during backup and recovery operations.
One of the challenges arises when dealing with distributed data that needs to be backed up and recovered without compromising data integrity. Addressing these issues requires synchronization techniques and a deep understanding of managing distributed data
Maintaining both data consistency and coherency is a concern in scaling out applications. Such applications often rely on distributed databases or file systems making it vital to ensure that the data remains synchronized and coherent across nodes during backup and recovery processes. Implementing mechanisms like versioning, conflict resolution or distributed locking becomes essential for strategizing backup and recovery.
Challenges in Scaling
As scale-out applications continue to grow in both size and complexity managing the growth of data becomes a concern. The challenges associated with backing up and recovering this amount of information are not just technical but operational in nature. Striking the balance between performance, scalability, and ensuring protection, for the data becomes a delicate task.
Furthermore, organizations must ensure that their backup infrastructure grows proportionately to the expansion of scale-out applications. This often requires planning and resource allocation. The challenges of scalability extend, to hardware, software, and human resources making it a complex issue that needs attention.
Meeting the demands of increasing data volumes can be a task. Adapting backup and recovery processes to handle growing data can put a strain on resources both in terms of hardware and human expertise. Decisions regarding storage capacity network bandwidth and processing power must align with the growth rate of scale-out applications. Therefore striking a balance between scalability and operational efficiency remains a challenge, in this landscape.
Best Practices for Scale-Out Applications in Backup Strategies
Creating a strategy for scale-out applications is crucial to guarantee the availability and recoverability of data. These types of applications come with a set of challenges. By following best practices you can overcome them successfully
- Scheduled Automated Backups: Set up automated schedules that regularly capture snapshots of your data. This reduces the risk of data loss. Ensures you have backups ready for recovery if needed.
- Distributed Backup Catalog: Maintain a distributed catalog that keeps track of locations and their contents. This is essential for data retrieval during the restoration process.
- Offsite Backup Storage: Consider storing backups offsite to protect against events. Cloud storage solutions or dispersed data centers can be invaluable for disaster recovery purposes.
- Monitoring and Alert Systems: Implement monitoring and alert systems that keep a watch on the health of your backups. Receive notifications about any failures or issues so that you can take corrective actions.
- Test Restores: Perform test restores on a basis to ensure the integrity of your backup data. Also, make sure that the restoration process is well documented and can be executed smoothly in case of a data loss scenario.
- Data Encryption: Implement encryption measures for your backups to safeguard information effectively. Ensure management and security of encryption keys.
- Stay on top of updates: Make sure to update your solutions and tools. These updates often include fixes for bugs and enhancements that can improve the reliability and effectiveness of your backups.
Emerging Technologies for Backup and Recovery in Scale-Out Environments
As scale-out applications continue to evolve the technologies for backup and recovery are also advancing. It’s important to stay up to date on the emerging technologies that tackle the challenges associated with backup and recovery in these environments. Here are some key areas to explore:
- Distributed Backup Solutions: Look into solutions specifically tailored for scale-out architectures. These solutions offer support for managing distributed data and handling snapshots efficiently.
- Containerization and Orchestration: Embrace containerization and orchestration technologies to streamline the management of backup and recovery processes within environments.
- Machine Learning and AI: Investigate how machine learning and artificial intelligence can be leveraged to anticipate and prevent data loss incidents thereby enhancing strategies and response times.
- Immutable Backups: Explore solutions that safeguard against unauthorized tampering or deletion of data. Immutable backups play a role in maintaining data integrity.
- Cloud Native Backup Services: Consider exploring the option of using backup services, such as those offered by leading cloud providers or specialized solutions, like Trilio. These services are specifically designed to integrate with cloud-based applications.
The standard tool is mongodump, which can take a dump of all data files in a node. The solution works great for a single node. For multi-mode configuration, the backup is a multi-step process:
- Stop config server(s)
- Take lock on a replica on each node
- Perform mongodump on each node
- Copy data files to the backup server
- Unlock the replica on each node
- Start config servers
Apart from the complexity of the backup process, mongodump tool has some more limitations:
- Mongodump does not support incremental backup. MongoDB suggests using file system level snapshot functionality to take incremental backups hence your ability to backup mongodb depends on the capability of underlying file system.
- Mongodump based backup process requires taking lock on each replica which means the replica won’t receive any new updates until the replica is unlocked. MongoDB can only tolerate oplog size worth of changes between the replicas. If the locked replica lags more than oplog worth changes, mongodump performs complete resync of the replica when it is unlocked. So unless you have an efficient backup process, every time backup is performed, MongoDB may end up resynchronizing entire replica.
- Mongodump only supports full restore. That means the current production cluster should be clobbered before the data can be restored from your backup media. The size of cluster being restored should match the size of the cluster when backup was performed.
- Object-level restore can be very tricky and resource intensive. If you want to restore a subset of objects from the backup media, you need to recreate identical cluster that existed at the time of backup, restore the backup data to the cluster and retrieve the objects you need.
Cassandra is a peer-to-peer highly scalable database system. Data is replicated among multiple nodes across multiple data centers. Single or even multi-node failures can be recovered from surviving nodes as long as that data exists on these nodes.
Cassandra saves data as SSTables where each SSTable is a file that contains key/value pairs. Cassandra creates numerous SSTables over a period of time. Though SSTables are immutable, Cassandra periodically performs compaction and merging of these files to consolidate and free up storage.
Cassandra uses linux hardlinks for creating snapshots. During snapshot operation, Cassandra creates hardlinks for each SSTables in a designated snapshot directory. As long as SSTables are not changed, the snapshot does not consume any additional storage. However during compaction or merge operation, the file system makes a copy of SSTable and hence require equal amount of free space on the system. Cassandra supports incremental snapshots as it creates hardlinks to SSTables that were created since last full snapshot operation.
Though Cassandra supports incremental snapshots, it still suffers from the same limitations as MongoDB native tool backup solution
- Though Cassandra supports incremental backups, it retains only one set of incremental backups. This incremental backup only covers changes made from the last snapshot. So to leverage incremental backups feature, your backup process should actively backup the incremental backups and delete them so that Casandra can create new incremental backup. Since these incremental backups are changes since last snapshot, the size of incremental backups increase as days pass by forcing your backup process to take a full snapshot periodically in order to keep size of incremental backups manageable.
- Usually backup scripts compress and tar SSTables before uploading them to non-local storage. These operations use system resources aggressively putting additional constraints on cluster resources.
- Cassandra does not provide restore utility. However the restore procedure is well documented. Restore operation can be performed per keyspace. This avoids performing a complete restore of the cluster. However the process is quite involving:
- Shutdown Cassandra
- Clear all files in the commit log directory, assuming that cluster shutdown flushes commit logs.
- Remove all contents of the active keyspace
- Copy contents of desired snapshot to active key space
- If incremental backups feature is enabled, then copy the contents from the full snapshot and copy of the contents of latest incremental backup on top of the restored snapshot.
- Repeat step c-e on all nodes.
To sum up, this article delves into the complexities and obstacles that come with backup and recovery, in scale-out applications specifically focusing on NoSQL databases like MongoDB and Cassandra. These applications are designed to handle scalability, data distribution, and replication which makes them resilient against hardware failures. However, there are still scenarios where point in time copies become necessary.
The article outlines the shortcomings of solutions for these architectures. These include issues related to data consistency, the nature of the databases, and the lack of support for incremental backups and object-level restores. It also emphasizes the importance of implementing practices for strategies such as scheduled automated backups, distributed catalogs, offsite storage options, monitoring capabilities testing restores regularly ensuring data encryption measures are in place, and staying up to date with advancements.
Additionally, this piece introduces emerging technologies that can potentially address the challenges faced in backup and recovery within scale-out environments. These technologies encompass distributed solutions, containerization techniques that enhance efficiency and portability through the isolation of resources within containers; machine learning advancements; immutable backups; and native backup services.
Furthermore, the article briefly touches upon considerations and limitations when it comes to backing up MongoDB and Cassandra databases. It underscores the complexity involved in these operations while highlighting the need, for solutions catering to these databases’ unique requirements.
At Trilio, we are hard at work solving these problems. To learn more, please contact us.