How does data deduplication work?

Recent years have witnessed an explosion in the proliferation of self-storage units. These large, warehouse units have sprung up nationally as a booming industry because of one reason—the average person now has more possessions than they know what to do with.

The same basic situation also plagues the world of IT. We’re in the midst of an explosion of data. Even relatively simple, everyday objects now routinely generate data on their own thanks to Internet of Things (IoT) functionality. Never before in history has so much data been created, collected and analyzed. And never before have more data managers wrestled with the problem of how to store so much data.

A company may initially fail to recognize the problem or how large it can become, and then that company has to find an increased storage solution. In time, the company may also outgrow that storage system, requiring even more investment. Inevitably, the company will tire of this game, and will seek a cheaper and simpler option—which brings us to data deduplication.

Although many organizations make use of data deduplication techniques (or “dedupe”) as part of their data management system, not nearly as many truly understand what the deduplication process is and what it’s intended to do. So, let’s demystify dedupe and explain how data deduplication works.

What does deduplication do?

First, let’s clarify our main term. Data deduplication is a process organizations use to streamline their data holdings and reduce the amount of data they’re archiving by eliminating redundant copies of data.

Furthermore, we should point out that when we speak about redundant data, we’re actually speaking at the file level and referring to a rampant proliferation of data files. So when we discuss data deduplication efforts, it’s actually a file deduplication system that’s needed.

What’s the main goal of deduplication?

Some people carry an incorrect notion about the nature of data, viewing it as a commodity that simply exists to be gathered and harvested—like apples off a tree from your own backyard.

The reality is that each new file of data costs money. In the first place, it usually costs money to obtain such data (through the purchase of data lists). Or it requires substantial financial investment for an organization to be able to gather and glean data on its own, even if it’s data that the organization itself is organically producing and collecting. Data sets, therefore, are an investment, and like any valuable investment, they must be protected rigorously.

In this instance, we’re talking about data storage space—be it in the form of on-premises hardware servers or through cloud storage via a cloud-based data center—that must be purchased or leased.

Duplicate copies of data that have undergone replication, therefore, detract from the bottom line by imposing additional storage costs beyond those associated with the primary storage system and its storage space. In short, more storage media assets must be devoted to accommodate both new data and already-stored data. At some point in a company’s trajectory, duplicate data can easily become a financial liability.

So, to sum up, the main goal of data deduplication is to save money by enabling organizations to spend less on extra storage.

Additional benefits of deduplication

There are also other reasons beyond storage capacity for companies to embrace data deduplication solutions—probably none more essential than the data protection and enhancement they provide. Organizations refine and optimize deduplicated data workloads so they will run more efficiently than data that’s rife with duplicate files.

Another important aspect of dedupe is how it helps empower a speedy and successful disaster recovery effort and minimizes the amount of data loss that can often result from such an event. Dedupe helps enable a sturdy backup process so an organization’s backup system is equal to the task of handling its backup data. In addition to helping with full backups, dedupe also aids in retention efforts.

Still another benefit of data deduplication is how well it works in conjunction with virtual desktop infrastructure (VDI) deployments, thanks to the fact that the virtual hard disks behind the VDI’s remote desktops operate identically. Popular Desktop as a Service (DaaS) products include Azure Virtual Desktop from Microsoft and its Windows VDI. These products create virtual machines (VMs), which are created during the server virtualization process. In turn, these virtual machines empower the VDI technology.

Deduplication methodology

The most commonly used form of data deduplication is block deduplication. This method operates by using automated functions to identify duplications in blocks of data and then remove those duplications. By working at this block level, chunks of unique data can be analyzed and specified as being worthy of validation and preservation. Then, when the deduplication software detects a repetition of the same data block, that repetition is removed and a reference to the original data is included in its place.

That’s the main form of dedupe, but hardly the only method. In other use cases, an alternate method of data deduplication operates at the file level. Single-instance storage compares full copies of data within the file server, but not chunks or blocks of data. Like its counterpart method, file deduplication depends upon keeping the original file within the file system and removing extra copies.

It should be noted that deduplication techniques do not work in quite the same manner as data compression algorithms (e.g., LZ77, LZ78), although it’s true that both pursue the same general goal of reducing data redundancies. Deduplication techniques achieve this on a larger, macro scale than compression algorithms, whose goal is less about replacing identical files with shared copies and more about more efficiently encoding data redundancies.

Types of data deduplication

There are different types of data deduplication depending on when the deduplication process occurs:

Inline deduplication: This form of data deduplication occurs in the moment—in real-time—as data flows within the storage system. The inline dedupe system carries less data traffic because it neither transfers nor stores duplicated data. This can lead to a reduction in the total amount of bandwidth needed by that organization.
Post-process deduplication: This type of deduplication takes place after data has been written and placed on some type of storage device.

Here it’s worth explaining that both types of data deduplication are affected by the hash calculations inherent to data deduplication. These cryptographic calculations are integral to identifying repeated patterns in data. During in-line deduplications, those calculations are performed in the moment, which can dominate and temporarily overwhelm computer functionality. In post-processing deduplications, the hash calculations can be performed at any time after the data is added in a way and at a time that doesn’t overtax the organization’s computer resources.

The subtle differences between deduplication types don’t end there. Another way to classify deduplication types is based on where such processes occur.

Source deduplication: This form of deduplication takes place near where new data is actually generated. The system scans that area and detects new copies of files, which are then removed.
Target deduplication: Another type of deduplication is like an inversion of source deduplication. In target deduplication, the system deduplicates any copies that are found in areas other than where the original data was created.

Because there are different types of deduplication practiced, forward-leaning organizations must make careful and considered decisions regarding the type of deduplication chosen, balancing that method against that company’s particular needs.

In many use cases, an organization’s deduplication method of choice may very well come down to a variety of internal variables, such as the following:

How many and what type of data sets are being created
The organization’s primary storage system
Which virtual environments are in use
Which apps the company rely upon

Recent data deduplication developments

Like all computer output, data deduplication is poised to make increasing use of artificial intelligence (AI) as it continues to evolve. Dedupe will grow increasingly sophisticated as it develops even more nuances that assist it in the pursuit of finding patterns of redundancy as blocks of data are scanned.

One emerging trend in dedupe is reinforcement learning. This uses a system of rewards and penalties (like in reinforcement training) and applies an optimal policy for separating records or merging them instead.

Another trend worth watching is the use of ensemble methods, in which different models or algorithms are used in tandem to ensure even greater accuracy within the dedupe process.

The ongoing dilemma

The IT world is becoming increasingly fixated on the ongoing issue of data proliferation and what to do about it. Many companies are finding themselves in the awkward position of simultaneously wanting to retain all the data they have worked to amass and also wanting to stick their overflowing new data in any storage container possible, if only to get it out of the way.

While such a dilemma persists, the emphasis on data deduplication efforts will continue as organizations see dedupe as the cheaper alternative to purchasing more storage. Because ultimately, although we intuitively understand that business needs data, we also know that data very often requires deduplication.

Learn how IBM Storage FlashSystem can help you with your storage needs

The post How does data deduplication work? appeared first on IBM Blog.