RE: Snapshot reserve management - toasters

10 Dec 2003

      At on stage, I ran into this frequently with large databases & batch
processing areas combined with snapmirror across an inadequate WAN pipe.
Expert internal advice and intimate knowledge of the database rebuild &
batch scheduling matrix still only provided partial knowledge of which block
changes were coming from where. Various nfsstat scripting/logging helped
identify which servers were committing the most nfs ops, but there was not
much we could do about it even with that information. Management involved
the snapshot deletion method, some late nights, temporary bandwidth upgrades
and ultimately some disk purchases. None of it could be honestly described
as prevention.
This has changed in my current environment. We currently use 2 tools;
    DataFabric Manager - our "project" areas a split into hundreds of
qtrees that can be monitored individually. Any large usage
increases/decreases in a particular area can be pinpointed. 
    Intermine Filecensus - With this we can quantify file/folders that
have been created/modified/accessed. Filecensus can even be configured to
scan snapshot areas if you want to compare between snapshot and current
content.
I must note that we do not really have the same snapshot problem in the
current environment. But I can see how DFM and Filecensus output could be
produced to help track down "paths" that are making the largest
contributions to filesystem change.
Hope I have added more answers than questions!
Aaron
-----Original Message-----
From: Brian Tao [mailto:taob@risc.org] 
Sent: Wednesday, 10 December 2003 2:10 PM
To: toasters@mathworks.com
Subject: Snapshot reserve management
I think most any Netapp admin has been in this situation:  you set
aside a chunk of disk space for your snapshot reserve.  After a week
goes by, you see that the reserve is at 150% of allocation.  You
manually delete some snapshots until it falls back under 100%, and
adjust the snap schedule.  A few months go by, new applications are
rolled in and old ones retire.  Snapshot usage has also increased, but
you are at a loss to pinpoint the exact cause of the higher data
turnover rate.
What do people do to shed more light on this kind of situation?
I'd love to be able to conclude "It is the files in /vol/vol0/myapp/data
that are chewing up the most snapshot space" or "It is the write
activity coming from NFS client myhost1 that is causing the most block
turnover".  I think I asked this question about five years ago and did
not discover an adequate solution back then.  I'm hoping someone might
be able to share their expertise on this problem now.  ;-)
-- 
Brian Tao (BT300, taob@risc.org)
"Though this be madness, yet there is method in't"

**************   IMPORTANT MESSAGE  **************
This e-mail message is intended only for the addressee(s) and contains information which may be confidential. If you are not the intended recipient please advise the sender by return email, do not use or disclose the contents, and delete the message and any attachments from your system. Unless specifically indicated, this email does not constitute formal advice or commitment by the sender or the Commonwealth Bank of Australia (ABN 48 123 123 124) or its subsidiaries.
**************************************************