May 12, 2011
Low Cost and Effective Data Deduplication
What is it?
The latest buzz words floating from your coworkers, vendors & tech conferences are “cloud” and “data deduplication.” Data dedup is a specialized data compression technique for eliminating coarse-grained redundant data, typically to improve storage utilization. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored, along with references to the unique copy of data. Deduplication is able to reduce the required storage capacity since only the unique data is stored.
Simply put: Software will compare a file to an original backup. If it finds the same file, it stores a metadata reference to the original backup, and deletes the redundant file.
Choices – each at ~ $50k
EMC’s DataDomain offers a product that will deduplicate your data inline (see below) from target devices. Basically, it is a head and controller model with a massive storage array in RAID.
Exagrid also offers a head and controller model, but with each new head that you put in, your backup gets “faster” due to more processors that you can build and link into the post-process “grid.”
IBM jumped into the game with their ProtectTier solution, offering a similar solution, guaranteed to work with other big blue hardware.
Post-process deduplication
With post-process deduplication, new data is first stored on the storage device and then a process at a later time will analyze the data looking for duplication. The benefit is that there is no need to wait for the hash calculations and lookup to be completed before storing the data thereby ensuring that store performance is not degraded. Implementations offering policy-based operation can give users the ability to defer optimization on “active” files, or to process files based on type and location. One potential drawback is that you may unnecessarily store duplicate data for a short time which is an issue if the storage system is near full capacity.
In-line deduplication
This is the process where the deduplication hash calculations are created on the target device as the data enters the device in real time. If the device spots a block that it already stored on the system it does not store the new block, just references to the existing block. The benefit of in-line deduplication over post-process deduplication is that it requires less storage as data is not duplicated. On the negative side, it is frequently argued that because hash calculations and lookups takes so long, it can mean that the data ingestion can be slower thereby reducing the backup throughput of the device. However, certain vendors with in-line deduplication have demonstrated equipment with similar performance to their post-process deduplication counterparts.
Great! Everyone wants to save on storage and backup, how do I do it at a lower cost than the high end vendors?
Simple solution: build it yourself!
Ingredients
- Symantec Netbackup 7.1 (or Backup Exec)
- A server with a lot of storage (24 TB IBM x3650 M3 Express)
Symantec now offers a client-based data deuplication with their Puredisk product that snaps right into your existing Netbackup and Backup Exec environments! This means that now you can dedup on whichever device you choose!
How will it work?
The deduplication hash calculations are initially created on the source (client) machines. Files that have identical hashes to files already in the target device are not sent, the target device just creates appropriate internal links to reference the duplicated data. The benefit of this is that it avoids data being unnecessarily sent across the network thereby reducing traffic load.
Total Cost: ~ 14k server + 6k software = $20k! Savings = $30k! w/ more far more storage.
Added, I enjoy your site! :)