Data deduplication is a technique that improves storage utilization by identifying unique data and storing only one instance of that data. Any occurrence of the unique data is replaced with a small reference to the correspondingly stored single instance of that data block.
Storage-based data deduplication is most effective in applications where many copies of very similar or even identical data are stored on a single system, which is surprisingly a common scenario. Data backups are one of the use cases where data deduplication is widely used since most data in a given backup remain unchanged from the previous backups. Other use cases include WAN optimization where only unique data or references to unique data are sent over to remote locations to synchronize copies of data sets between locations. Yet another use case is a large deployment of virtual servers or VMs that share the same VM image. However, this use case can easily be implemented using QCOW2 images with a base VM image as the backing file.
There are some good discussions on the web on various data deduplication technologies, however, for our discussion, we focus mainly on inline deduplication in file systems.
There are primarily two ways to deduplicate the data in the file system:
- Variable size blocks – This approach is well suited for file-level deduplication. Variable size deduplication uses rolling CRC similar to that of the rsync algorithm to calculate the unique data of variable size. This kind of deduplication is good for file-level deduplication because of the way a file is usually modified. File edits include inserting new data or deleting existing data. When these changes happen in the middle of the file, the offset of the data that follows the modification will change. Variable size deduplication can detect unmodified data even when the offset is changed. One major drawback of this approach is it is CPU intensive, as the rolling CRC needs to pass thru the entire file to calculate the occurrence of unmodified data. Since CRC is prone to false positives, other strong hashing algorithms such as SHA1 or SHA256 usually substantiate CRC. Most “dedupe at source” backup applications use this technique.
- Fixed-size blocks – Well-suited for block-level storage deduplication including VM image backups, LUN snapshots, etc. The assumption here is that the block offset and size are fixed so the deduplication process includes computing a digital signature such as SHA128 or SHA256 for each block and indexing the data block using its signature. Compared to variable size-based deduplication, fixed-size blocks only require SHA computation and then indexing blocks based on SHA so it is computationally more efficient. ZFS, an open-source file system, uses fixed-size deduplication.
There are three main components of any deduplication file system.
- Chunk store where all unique data blocks are stored
- Lookup which indexes the SHA of the data block to the location of data in the chunk store
- File metadata that maps the file offset and length to the sequence of SHAs that identifies data blocks in the chunk store
Algorithms to implement dedupe file systems have evolved over the last few years and data deduplication has now become a mainstream feature of any storage appliances.
Deduplication vs. Compression
Though theoretically, both methods eliminate redundant data in a data set, compression is well known to remove redundancy within a file or a group of files archived together into a single file. Compression is more efficient for smaller data sets with redundant data. Data deduplication is the process of eliminating redundant data within a storage system that comprises large data sets. These data sets can be a number of file systems or groups of storage devices. The data deduplication is efficient for larger data sets.
Scale-Out File Systems
Over the last few years, the explosion of big data from social media and other sources has given rise to highly scalable applications. Though these applications have inbuilt redundancy to protect against hardware failures, Backup and DR are also essential strategies for implementing comprehensive data protection. However traditional backup systems are either single-node or two-node systems with limited scalability and capacity and are not suitable to meet the backup needs of these applications. New backup systems must be based on the same principles as scale-out architectures in order to meet their backup needs. And one of the critical features is the support of deduplication in these new classes of backup systems.
Though the algorithms to implement deduplication in traditional file systems are well documented, algorithms to implement deduplication in scale-out file systems are less known. In this blog, I will lay out the challenges in implementing deduplication in scale-out file systems.
Let’s discuss the essential attributes of scaling out file systems first and then discuss the challenges involved in implementing deduplication in these file systems.
Horizontal Scale
Loosely coupled nodes with data distributed across all nodes. Data is distributed uniformly among the nodes so each node’s physical resources such as memory, IO channels, and disks are equally utilized to scale the performance.
Data Rebalancing
Data distribution in scale-out file systems can be skewed over a period of time: this happens because new nodes are added to the system; existing nodes are retired, data is deleted or nodes temporarily go offline and come back online. In order to get the data distributed uniformly again, these file systems regularly rebalance the data between the nodes.
Replication
Scale-out file systems are meant to grow large, in fact very large. Some of the deployments may scale to hundreds of nodes, so given the MTBF of a typical hard disk statistically, it is possible to lose a disk every few days. In order to cope with these failures, these file systems have built-in replication to maintain at least one accessible copy of data in the face of a failure.
So when designing deduplication in these file systems, it must as well adhere to these essential attributes:
Horizontal Scale
Indexes, Chunk stores, as well as the file metadata of the deduplication, should be uniformly distributed across all nodes of the file system. Usually index and chunk stores are co-located on the same node so chunk lookups can be local and do not span multiple nodes. File metadata can uniformly spread across nodes for optimal performance.
Data Rebalancing
Index, chunk stores, and file metadata need to be periodically migrated between nodes to achieve a uniform storage distribution.
Replication
Replication is the tricky feature of all. After all, deduplication is about eliminating redundant data which is at odds with replication. However unique data need to be replicated for high availability. There are two ways how this can be achieved.
- Make two distinct file systems and replicate files between these two file systems. Each file system has deduplication built in so each file system stores files efficiently.
- Index and chunk stores keep multiple copies of each chunk on multiple nodes.
The first option is easier to manage because it leverages existing file system replication to achieve multiple copies of chunk stores for high availability.
Other considerations when implementing deduplication in these architectures are:
Garbage Collection
Each unique data chunk in a dedupe file system can be referenced by multiple files; hence, it is important to know how many outstanding references exist on a unique data chunk before it can be deleted. Traditional file systems such as ZFS keep a reference count on each unique data and delete the block when the reference count hits zero. Scale-out architectures usually spread the metadata across nodes and there is no single place where a reference count per unique data block can be kept so periodic garbage collection of unreferenced unique data blocks should be performed to reclaim the storage.
Global Journaling
Dedupe file systems often employ journals for updates to be durable because each update touches more metadata than traditional file systems. For example, appending to a file requires adding SHA to file metadata, creating an index, and adding the data to the chunk store. If the system crashes in the middle of these updates, it leaves the file system in an inconsistent state. As with any scale-out architectures, the journal has to be global so any surviving node can replay the journal entries and bring back file system data structures to a consistent state again.