It seems as if you've experienced the high CPU yourself, this is not just theory. Can you tell me when this was (during the de-dup scheduled runs, or during the normal writing)? We're looking to implement this pretty heavily for some of our filers, even for Tier 1 in some cases. I'd like to know what to watch out for before we step into this...
As I understand it, deduplication works like this:
1) Server receives a block of data to write to disk.
2) Server computes the MD5 digest for the data in the block, which is a 128 bit value. This calculation is CPU intensive. (SHA may be used instead of MD5)
3) Server looks up the digest in a hash table to see if there is already a block on disk with the same digest, i.e., a potential duplicate.
4) If found, server verifies that the new block and existing block are indeed identical. If so, then the server uses a reference to the existing block rather than writing the new block to disk.
5) If there is no matching block for the new block, then it is written to disk and its digest is added to the hash table.
When first setting up deduplication, the server does not have the hash table yet, so it scans each data block on disk and computes its MD5 digest and builds the hash table. During this process, duplicate blocks may be discovered, in which case each duplicate block can be freed and replaced with a reference to an identical block.
MD5 is CPU intensive, but CPUs are now so fast that this may not be an issue. And perhaps netapp has optimized its MD5 implementation or is using some other digest algorithm that uses less CPU.
Steve Losen scl@virginia.edu phone: 434-924-0640
University of Virginia ITC Unix Support