Re: A-SIS questions

1 Nov 2007


      ...
It seems as if you've experienced the high CPU yourself, this is not
just theory.  Can you tell me when this was (during the de-dup scheduled
runs, or during the normal writing)?  We're looking to implement this
pretty heavily for some of our filers, even for Tier 1 in some cases.
I'd like to know what to watch out for before we step into this...
As I understand it, deduplication works like this:
1) Server receives a block of data to write to disk.
2) Server computes the MD5 digest for the data in the block,
   which is a 128 bit value.  This calculation is CPU intensive.
   (SHA may be used instead of MD5)
3) Server looks up the digest in a hash table to see if there
   is already a block on disk with the same digest, i.e., a
   potential duplicate.
4) If found, server verifies that the new block and existing block
   are indeed identical.  If so, then the server uses a reference
   to the existing block rather than writing the new block to disk.
5) If there is no matching block for the new block, then it is written
   to disk and its digest is added to the hash table.
When first setting up deduplication,  the server does not have the
hash table yet, so it scans each data block on disk and computes
its MD5 digest and builds the hash table.  During this process,
duplicate blocks may be discovered, in which case each duplicate
block can be freed and replaced with a reference to an identical
block.
MD5 is CPU intensive, but CPUs are now so fast that this may not
be an issue.  And perhaps netapp has optimized its MD5
implementation or is using some other digest algorithm that uses
less CPU.
Steve Losen   scl@virginia.edu    phone: 434-924-0640
University of Virginia               ITC Unix Support

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: A-SIS questions