We have a fairly heavily loaded FAS960c pair that contains storage
for our University wide email system. Most of the email storage
is NFS files with the email servers running Unix and Communigate Pro.
We are transitioning to MS Exchange, so these filers also have some
FC SAN LUNs for our emerging Exchange service.
The other day we cleaned up about a hundred NFS email inboxes, average size
about 100M, but a few were approaching 1G. We removed the files on a NFS
client and immediately after the rm command returned, we experienced a
serious performance problem on the 960s.
sysstat indicated that the CPU was pegged at near 100% while all I/O
throughput (network, disk, FC SAN) and all file ops (NFS, FCP) dropped to
almost nothing. Something grabbed the filer CPU for a minute or two which
seriously impacted all of our email servers. We had to restart them all.
I suspect that the CPU load was caused by some processing having to do with
recovering disk blocks freed by the file deletes. But no blocks were
actually freed because the volume had snapshots that were newer than the
deleted files. Perhaps the number of snapshots (41) was a factor.
I opened a case with netapp on this, but repeating the problem will have
dire consequences on our production email systems, so we can't send them
performance metrics.
I checked bugs online on NOW and didn't find anything that seemed to apply
that wasn't marked fixed. I did see a very old bug (4157) first fixed in
DOT 5.1, where WAFL would deadlock if many large files were deleted all at
once.
I was just curious if anyone else has run into anything like this.
We are running DOT 7.2.3. In the future when we delete a lot of big
files, we'll do them one at a time, with sleeps in between.
Steve Losen scl(a)virginia.edu phone: 434-924-0640
University of Virginia ITC Unix Support