At on stage, I ran into this frequently with large databases & batch processing areas combined with snapmirror across an inadequate WAN pipe. Expert internal advice and intimate knowledge of the database rebuild & batch scheduling matrix still only provided partial knowledge of which block changes were coming from where. Various nfsstat scripting/logging helped identify which servers were committing the most nfs ops, but there was not much we could do about it even with that information. Management involved the snapshot deletion method, some late nights, temporary bandwidth upgrades and ultimately some disk purchases. None of it could be honestly described as prevention.
This has changed in my current environment. We currently use 2 tools; DataFabric Manager - our "project" areas a split into hundreds of qtrees that can be monitored individually. Any large usage increases/decreases in a particular area can be pinpointed. Intermine Filecensus - With this we can quantify file/folders that have been created/modified/accessed. Filecensus can even be configured to scan snapshot areas if you want to compare between snapshot and current content.
I must note that we do not really have the same snapshot problem in the current environment. But I can see how DFM and Filecensus output could be produced to help track down "paths" that are making the largest contributions to filesystem change.
Hope I have added more answers than questions!
Aaron
-----Original Message----- From: Brian Tao [mailto:taob@risc.org] Sent: Wednesday, 10 December 2003 2:10 PM To: toasters@mathworks.com Subject: Snapshot reserve management
I think most any Netapp admin has been in this situation: you set aside a chunk of disk space for your snapshot reserve. After a week goes by, you see that the reserve is at 150% of allocation. You manually delete some snapshots until it falls back under 100%, and adjust the snap schedule. A few months go by, new applications are rolled in and old ones retire. Snapshot usage has also increased, but you are at a loss to pinpoint the exact cause of the higher data turnover rate.
What do people do to shed more light on this kind of situation? I'd love to be able to conclude "It is the files in /vol/vol0/myapp/data that are chewing up the most snapshot space" or "It is the write activity coming from NFS client myhost1 that is causing the most block turnover". I think I asked this question about five years ago and did not discover an adequate solution back then. I'm hoping someone might be able to share their expertise on this problem now. ;-)