I think most any Netapp admin has been in this situation: you set aside a chunk of disk space for your snapshot reserve. After a week goes by, you see that the reserve is at 150% of allocation. You manually delete some snapshots until it falls back under 100%, and adjust the snap schedule. A few months go by, new applications are rolled in and old ones retire. Snapshot usage has also increased, but you are at a loss to pinpoint the exact cause of the higher data turnover rate.
What do people do to shed more light on this kind of situation? I'd love to be able to conclude "It is the files in /vol/vol0/myapp/data that are chewing up the most snapshot space" or "It is the write activity coming from NFS client myhost1 that is causing the most block turnover". I think I asked this question about five years ago and did not discover an adequate solution back then. I'm hoping someone might be able to share their expertise on this problem now. ;-)
if your Filer support CIFS client with office kind application you could think of using the following option :
* options cifs.snapshot_file_folding.enable on*
(by default, the option is turned off) as you certainly know Windowz use to create a copy of the document on which you want to work then if you save your work, Windoz erase the original and rename the temporary file of the original name this is the windoz way to work on office files (word, excel,...) the problem with this implementation is if you open a file to just change one word the whole file is rewritten on Filers, this has an impact on snapshot because the whole previous version of the file needs to be protected by the last snapshot
the option I advise you is an Ontapp mechanism to limit this unfeature in trying to recover actual blocks of the file from the snapshot this way, snapshot usage could be limited
hth
Brian Tao wrote:
I think most any Netapp admin has been in this situation: you set aside a chunk of disk space for your snapshot reserve. After a week goes by, you see that the reserve is at 150% of allocation. You manually delete some snapshots until it falls back under 100%, and adjust the snap schedule. A few months go by, new applications are rolled in and old ones retire. Snapshot usage has also increased, but you are at a loss to pinpoint the exact cause of the higher data turnover rate.
What do people do to shed more light on this kind of situation? I'd love to be able to conclude "It is the files in /vol/vol0/myapp/data that are chewing up the most snapshot space" or "It is the write activity coming from NFS client myhost1 that is causing the most block turnover". I think I asked this question about five years ago and did not discover an adequate solution back then. I'm hoping someone might be able to share their expertise on this problem now. ;-)
Well, the only times your snapshot usage will grow is when a block is overwritten or deleted. Whenever I see a big snapshot usage increase I can usually pinpoint the reason (DBAs either overwrote or deleted a database). Simply adding a cron job like: 0 * * * * rsh filer_ip df >> df-history
.. can give you some good historical data you can use for trending.
You might be able to learn something from comparing find runs between different snapshots and/or your active filesystem.
Attached is a python script I wrote to give me an overall view into qtree/volume usage, etc. It's a little specific to my environment, but you might be able to get some use out of it. I log the output of this to a file every hour and later pass over it w/ gnuplot to display pretty qtree trending info. If I felt like doing it the right way, I'd shove all this data into a mysql database...
Sample output (anonymized a bit):
generating report at Wed Dec 10 13:33:06 2003
---] qtree usage report [------------------------------------------------------ filer vol qtree Disk Usage(gb) Inodes ------------------------------------------------------------------------------- nas1 vol0 abc 25/35 (73.89%) 37720/- nas1 vol0 aardvarks 0/3 (24.62%) 65191/- nas1 vol0 night 1/3 (57.67%) 1426/- nas1 vol0 qwe1 55/100 (55.83%) 114/- nas1 vol0 antelope 12/100 (12.53%) 828/- nas1 vol0 cheeta 0/10 (0.38%) 2441/- nas1 vol0 nonuseful 0/3 (3.53%) 4814/- nas1 vol0 qua-log 22/200 (11.25%) 91/- nas2 vol0 test 0/20 (0.74%) 5012/- nas2 dw2 vb1 312/400 (78.10%) 198/- nas2 dw2 sdg-log 24/40 (60.15%) 425/- nas2 vol1 sdfs 330/400 (82.73%) 281/- nas2 vol1 fghfd 1/10 (16.09%) 30/- nas2 vol1 linux 10/100 (10.79%) 31/- nas2 vol1 windows 106/300 (35.45%) 5108823/- nas2 vol1 test 0/20 (0.74%) 5012/- nas2 vol1 testing 5/50 (10.66%) 47/- nas2 vol1 blahblah 0/5 (8.01%) 41/-
---] volume usage - usable (gb) [---------------------------------------------- filer volume usage nas1 /vol/vol0/ 617/2036 (%30.34) nas2 /vol/vol1/ 464/1147 (%40.52) nas2 /vol/vol0/ 852/1147 (%74.33) nas2 /vol/dw2/ 336/1147 (%29.37)
---] volume usage - raw (gb) [------------------------------------------------- filer volume usage nas1 /vol/vol0/ 715/2868 (%24.95) nas2 /vol/vol1/ 523/1434 (%36.50) nas2 /vol/vol0/ 956/1434 (%66.72) nas2 /vol/dw2/ 563/1434 (%39.31)
---] volume allocation totals (gb) [------------------------------------------- nas1:vol0 1304/2036 (%64.03) nas2:dw2 440/1147 (%38.35) nas2:vol0 1180/1147 (%102.85) nas2:vol1 1000/1147 (%87.16)
---] filer usage (all volumes) - usable (gb) [--------------------------------- filer usage nas1 617/2036 (%30.34) nas2 1654/3441 (%48.07)
---] filer usage (all volumes) - raw (gb) [------------------------------------ filer usage nas1 715/2868 (%24.95) nas2 2044/4302 (%47.51)
---] warnings [-----------------------------------------------------------------WARNING: nas2:vol0 is over-allocated by around 147 GB!
You'll need to modify the
FILERS = { 'nas1': '10.1.30.218', 'nas2': '10.1.30.219' }
line at the top of the file to reflect your filer(s). The machine it runs on needs rsh access to the filer. You'll also need to make sure you have nosnapdir OFF on all volumes.
On Tue, 9 Dec 2003, Brian Tao wrote:
I think most any Netapp admin has been in this situation: you set
aside a chunk of disk space for your snapshot reserve. After a week goes by, you see that the reserve is at 150% of allocation. You manually delete some snapshots until it falls back under 100%, and adjust the snap schedule. A few months go by, new applications are rolled in and old ones retire. Snapshot usage has also increased, but you are at a loss to pinpoint the exact cause of the higher data turnover rate.
What do people do to shed more light on this kind of situation?
I'd love to be able to conclude "It is the files in /vol/vol0/myapp/data that are chewing up the most snapshot space" or "It is the write activity coming from NFS client myhost1 that is causing the most block turnover". I think I asked this question about five years ago and did not discover an adequate solution back then. I'm hoping someone might be able to share their expertise on this problem now. ;-) -- Brian Tao (BT300, taob@risc.org) "Though this be madness, yet there is method in't"
-- Antonio Varni
Of course it'd be nice if I attached the script :)
-- Antonio Varni Systems Engineer, Technology Integration & Operations Commission Junction 1501 Chapala Street Santa Barbara, CA 93101 p 805.899.8934 f 805.570.6678 avarni@cj.com www.cj.com