As the amount of data stored on our networks increases, it also takes more time to make backup copies of this data. This presents problems as the time for these backups lengthens beyond the overnight period.
One solution is to eliminate duplicate data that is backed up. How much can you save? A lot. In some cases, there is a more than a 10-to-1 savings; meaning that 90% of your data is duplicates. Eliminating these redundant files can go a long way towards speeding up the backup process. As the screen shot of Symantec’s PureDisk NetBackup shows, more than 95% of the data files have been eliminated as a result of the deduplication process, going from a backup of more than 3GB to about 150MB.
Deduplication seems like a simple concept, but picking the right deduplication product isn’t. There are dozens of vendors, including:
Atempo Time Navigator
Backup Cofio.com AIMStor
There are also all sorts of technical wrinkles to understand before making the right purchase of a deduplication product. Here is a checklist and some suggestions as you navigate these waters:
First off, where is the software agent located that controls the deduplication process? Some products put their agents at the source, meaning on each and every server that will be backed up, and others on the actual backup appliance. You need to put it someplace, and depending on your particular set of servers and circumstances, and IT policies, you may prefer one or the other method. Some of the products, like CA’s Arcserve Backup, can now work with agents in both locations.
Second, how does the deduplication appliance appear to the backup software app? Some deduplication boxes appear like a network-attached storage device, while others appear like a storage area network drive. Depending on the backup software that you already have, one or these might be more appealing to your situation.
Does the deduplication agent have any granularity with any particular apps or OSs? Some products can examine individual email messages, or database records, or files that have changed on a particular virtual machine instance. As more and more shops make use of virtualization technology, this factor becomes increasingly important, as the size of the virtual disk images can be enormous, yet they contain mostly the same common files for the operating system and underlying applications. This makes these deduplication products more useful when working with the backup software when the need comes to restore these particular files from inside the virtual images.
Do you need special hardware or does the deduplication function come included as part of the backup software? A number of the usual backup software vendors are moving towards integrating deduplication functionality in their products. For example, enabling data deduplication functionality on both Symantec’s NetBackup 7 and Backup Exec 2010 requires only a single check mark in a pop-up box in one of their control menus.
Is deduplication happening during the live stream of backup data or does some post-processing occur? This means that the backup could be first staged to a hard drive designed for this purpose, and then the duplicates are later removed. If the former, do you have enough storage capacity to hold all of your backup files, and can you add more storage as your needs grow?
Finally, how does the deduplication product fit into your overall storage resource management picture? Can you examine file aging reports, that show which files haven’t been accessed by your users for more than 90 days, for example? Or understand how your storage area networks are using their disk arrays, and perhaps reconfigure them for more optimal usage? Or drill down and see how your particular applications are using your overall storage resources? These and other analyses are valuable if you are going to be able to more effectively manage your storage needs.
Dave Strom is a freelance writer living in St. Louis and the former editor-in-chief of Network Computing magazine, DigitialLanding.com, and Tom’s Hardware.com. He has written two books and numerous articles on networking, the Internet, and IT security topics. He can be reached at [email protected] and his blog can be found at strominator.com.