We have a fairly heavily loaded FAS960c pair that contains storage for our University wide email system. Most of the email storage is NFS files with the email servers running Unix and Communigate Pro. We are transitioning to MS Exchange, so these filers also have some FC SAN LUNs for our emerging Exchange service.
The other day we cleaned up about a hundred NFS email inboxes, average size about 100M, but a few were approaching 1G. We removed the files on a NFS client and immediately after the rm command returned, we experienced a serious performance problem on the 960s.
sysstat indicated that the CPU was pegged at near 100% while all I/O throughput (network, disk, FC SAN) and all file ops (NFS, FCP) dropped to almost nothing. Something grabbed the filer CPU for a minute or two which seriously impacted all of our email servers. We had to restart them all.
I suspect that the CPU load was caused by some processing having to do with recovering disk blocks freed by the file deletes. But no blocks were actually freed because the volume had snapshots that were newer than the deleted files. Perhaps the number of snapshots (41) was a factor.
I opened a case with netapp on this, but repeating the problem will have dire consequences on our production email systems, so we can't send them performance metrics.
I checked bugs online on NOW and didn't find anything that seemed to apply that wasn't marked fixed. I did see a very old bug (4157) first fixed in DOT 5.1, where WAFL would deadlock if many large files were deleted all at once.
I was just curious if anyone else has run into anything like this. We are running DOT 7.2.3. In the future when we delete a lot of big files, we'll do them one at a time, with sleeps in between.
Steve Losen scl@virginia.edu phone: 434-924-0640
University of Virginia ITC Unix Support
There is a known issue regarding large file deletions. I know that NA is actively working this but I cannot speak to an ETA on it's fix.
The work around at this point is "don't do that", or at least, "don't do that all at once". Yes, it does seem silly but until they have a fix for this, that's all anyone can do. Your workaround is the "right" thing at this point. C-
Stephen C. Losen wrote:
We have a fairly heavily loaded FAS960c pair that contains storage for our University wide email system. Most of the email storage is NFS files with the email servers running Unix and Communigate Pro. We are transitioning to MS Exchange, so these filers also have some FC SAN LUNs for our emerging Exchange service.
The other day we cleaned up about a hundred NFS email inboxes, average size about 100M, but a few were approaching 1G. We removed the files on a NFS client and immediately after the rm command returned, we experienced a serious performance problem on the 960s.
sysstat indicated that the CPU was pegged at near 100% while all I/O throughput (network, disk, FC SAN) and all file ops (NFS, FCP) dropped to almost nothing. Something grabbed the filer CPU for a minute or two which seriously impacted all of our email servers. We had to restart them all.
I suspect that the CPU load was caused by some processing having to do with recovering disk blocks freed by the file deletes. But no blocks were actually freed because the volume had snapshots that were newer than the deleted files. Perhaps the number of snapshots (41) was a factor.
I opened a case with netapp on this, but repeating the problem will have dire consequences on our production email systems, so we can't send them performance metrics.
I checked bugs online on NOW and didn't find anything that seemed to apply that wasn't marked fixed. I did see a very old bug (4157) first fixed in DOT 5.1, where WAFL would deadlock if many large files were deleted all at once.
I was just curious if anyone else has run into anything like this. We are running DOT 7.2.3. In the future when we delete a lot of big files, we'll do them one at a time, with sleeps in between.
Steve Losen scl@virginia.edu phone: 434-924-0640
University of Virginia ITC Unix Support
I've seen this too. it can happen with large file deletions and with many many small file deletions. Mostly it has to do with running out of zombie processes to reap the deletes. As Chris said, your best bet is to delete slowly and cautiously.
-Blake
On Jan 25, 2008 6:50 AM, Chris Blackmor chris.blackmor@amd.com wrote:
There is a known issue regarding large file deletions. I know that NA is actively working this but I cannot speak to an ETA on it's fix.
The work around at this point is "don't do that", or at least, "don't do that all at once". Yes, it does seem silly but until they have a fix for this, that's all anyone can do. Your workaround is the "right" thing at this point. C-
Stephen C. Losen wrote:
We have a fairly heavily loaded FAS960c pair that contains storage for our University wide email system. Most of the email storage is NFS files with the email servers running Unix and Communigate Pro. We are transitioning to MS Exchange, so these filers also have some FC SAN LUNs for our emerging Exchange service.
The other day we cleaned up about a hundred NFS email inboxes, average size about 100M, but a few were approaching 1G. We removed the files on a NFS client and immediately after the rm command returned, we experienced a serious performance problem on the 960s.
sysstat indicated that the CPU was pegged at near 100% while all I/O throughput (network, disk, FC SAN) and all file ops (NFS, FCP) dropped to almost nothing. Something grabbed the filer CPU for a minute or two which seriously impacted all of our email servers. We had to restart them all.
I suspect that the CPU load was caused by some processing having to do with recovering disk blocks freed by the file deletes. But no blocks were actually freed because the volume had snapshots that were newer than the deleted files. Perhaps the number of snapshots (41) was a factor.
I opened a case with netapp on this, but repeating the problem will have dire consequences on our production email systems, so we can't send them performance metrics.
I checked bugs online on NOW and didn't find anything that seemed to apply that wasn't marked fixed. I did see a very old bug (4157) first fixed in DOT 5.1, where WAFL would deadlock if many large files were deleted all at once.
I was just curious if anyone else has run into anything like this. We are running DOT 7.2.3. In the future when we delete a lot of big files, we'll do them one at a time, with sleeps in between.
Steve Losen scl@virginia.edu phone: 434-924-0640
University of Virginia ITC Unix Support
--
- Chris Blackmor _______ | *
- Advanced Micro Devices ____ | | A good horse never comes *
- Phone: (512) 602-1608 /| | | | in a bad color! *
- Fax: (512) 602-5155 | |___| | | *
- Email: chris.blackmor@amd.com |____/ | | Author Unknown*
My comments are mine, and mine alone. *
The bug is 90314, but the bug description doesn't have any more details then what's been on here.
John
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Blake Golliher Sent: Friday, January 25, 2008 10:08 AM To: Blackmor, Chris; Stephen C. Losen Cc: toasters@mathworks.com Subject: Re: Deleting many large files spikes filer CPU
I've seen this too. it can happen with large file deletions and with many many small file deletions. Mostly it has to do with running out of zombie processes to reap the deletes. As Chris said, your best bet is to delete slowly and cautiously.
-Blake
On Jan 25, 2008 6:50 AM, Chris Blackmor chris.blackmor@amd.com wrote:
There is a known issue regarding large file deletions. I know that NA is actively working this but I cannot speak to an ETA on it's fix.
The work around at this point is "don't do that", or at least, "don't do that all at once". Yes, it does seem silly but until they have a fix for this, that's all anyone can do. Your workaround is the "right" thing at this point. C-
Stephen C. Losen wrote:
We have a fairly heavily loaded FAS960c pair that contains storage for our University wide email system. Most of the email storage is NFS files with the email servers running Unix and Communigate
Pro.
We are transitioning to MS Exchange, so these filers also have some FC SAN LUNs for our emerging Exchange service.
The other day we cleaned up about a hundred NFS email inboxes,
average size
about 100M, but a few were approaching 1G. We removed the files on
a NFS
client and immediately after the rm command returned, we experienced
a
serious performance problem on the 960s.
sysstat indicated that the CPU was pegged at near 100% while all I/O throughput (network, disk, FC SAN) and all file ops (NFS, FCP)
dropped to
almost nothing. Something grabbed the filer CPU for a minute or two
which
seriously impacted all of our email servers. We had to restart them
all.
I suspect that the CPU load was caused by some processing having to
do with
recovering disk blocks freed by the file deletes. But no blocks
were
actually freed because the volume had snapshots that were newer than
the
deleted files. Perhaps the number of snapshots (41) was a factor.
I opened a case with netapp on this, but repeating the problem will
have
dire consequences on our production email systems, so we can't send
them
performance metrics.
I checked bugs online on NOW and didn't find anything that seemed to
apply
that wasn't marked fixed. I did see a very old bug (4157) first
fixed in
DOT 5.1, where WAFL would deadlock if many large files were deleted
all at
once.
I was just curious if anyone else has run into anything like this. We are running DOT 7.2.3. In the future when we delete a lot of big files, we'll do them one at a time, with sleeps in between.
Steve Losen scl@virginia.edu phone: 434-924-0640
University of Virginia ITC Unix Support
--
------------------------------------------------------------------------ -----
- Chris Blackmor _______ |
*
- Advanced Micro Devices ____ | | A good horse never
comes *
- Phone: (512) 602-1608 /| | | | in a bad color!
*
- Fax: (512) 602-5155 | |___| | |
*
- Email: chris.blackmor@amd.com |____/ | | Author
Unknown*
------------------------------------------------------------------------ -----
My comments are mine, and mine alone.
*
------------------------------------------------------------------------ -----
Ditto - I've run into that very bug. Only workaround is to delete more slowly\methodically. I will say that it's not nearly as bad in recent releases as it was in earlier ones.
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Clear, John Sent: Friday, January 25, 2008 2:02 PM To: Blake Golliher; Blackmor, Chris; Stephen C. Losen Cc: toasters@mathworks.com Subject: RE: Deleting many large files spikes filer CPU
The bug is 90314, but the bug description doesn't have any more details then what's been on here.
John
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Blake Golliher Sent: Friday, January 25, 2008 10:08 AM To: Blackmor, Chris; Stephen C. Losen Cc: toasters@mathworks.com Subject: Re: Deleting many large files spikes filer CPU
I've seen this too. it can happen with large file deletions and with many many small file deletions. Mostly it has to do with running out of zombie processes to reap the deletes. As Chris said, your best bet is to delete slowly and cautiously.
-Blake
On Jan 25, 2008 6:50 AM, Chris Blackmor chris.blackmor@amd.com wrote:
There is a known issue regarding large file deletions. I know that NA is actively working this but I cannot speak to an ETA on it's fix.
The work around at this point is "don't do that", or at least, "don't do that all at once". Yes, it does seem silly but until they have a fix for this, that's all anyone can do. Your workaround is the "right" thing at this point. C-
Stephen C. Losen wrote:
We have a fairly heavily loaded FAS960c pair that contains storage for our University wide email system. Most of the email storage is NFS files with the email servers running Unix and Communigate
Pro.
We are transitioning to MS Exchange, so these filers also have some FC SAN LUNs for our emerging Exchange service.
The other day we cleaned up about a hundred NFS email inboxes,
average size
about 100M, but a few were approaching 1G. We removed the files on
a NFS
client and immediately after the rm command returned, we experienced
a
serious performance problem on the 960s.
sysstat indicated that the CPU was pegged at near 100% while all I/O throughput (network, disk, FC SAN) and all file ops (NFS, FCP)
dropped to
almost nothing. Something grabbed the filer CPU for a minute or two
which
seriously impacted all of our email servers. We had to restart them
all.
I suspect that the CPU load was caused by some processing having to
do with
recovering disk blocks freed by the file deletes. But no blocks
were
actually freed because the volume had snapshots that were newer than
the
deleted files. Perhaps the number of snapshots (41) was a factor.
I opened a case with netapp on this, but repeating the problem will
have
dire consequences on our production email systems, so we can't send
them
performance metrics.
I checked bugs online on NOW and didn't find anything that seemed to
apply
that wasn't marked fixed. I did see a very old bug (4157) first
fixed in
DOT 5.1, where WAFL would deadlock if many large files were deleted
all at
once.
I was just curious if anyone else has run into anything like this. We are running DOT 7.2.3. In the future when we delete a lot of big files, we'll do them one at a time, with sleeps in between.
Steve Losen scl@virginia.edu phone: 434-924-0640
University of Virginia ITC Unix Support
--
------------------------------------------------------------------------ -----
- Chris Blackmor _______ |
*
- Advanced Micro Devices ____ | | A good horse never
comes *
- Phone: (512) 602-1608 /| | | | in a bad color!
*
- Fax: (512) 602-5155 | |___| | |
*
- Email: chris.blackmor@amd.com |____/ | | Author
Unknown*
------------------------------------------------------------------------ -----
My comments are mine, and mine alone.
*
------------------------------------------------------------------------ -----
Hi Stephen,
we had a similar issue some time ago. After having opened a call with netapp the engineering finally gave us some options one can set.
Sadly i do not have the options by the hand but maybe you can contact your netapp colleagues or open a case. For us the options did their work.
Best Regards
Jochen
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Stephen C. Losen Sent: Friday, January 25, 2008 2:03 PM To: toasters@mathworks.com Subject: Deleting many large files spikes filer CPU
We have a fairly heavily loaded FAS960c pair that contains storage for our University wide email system. Most of the email storage is NFS files with the email servers running Unix and Communigate Pro. We are transitioning to MS Exchange, so these filers also have some FC SAN LUNs for our emerging Exchange service.
The other day we cleaned up about a hundred NFS email inboxes, average size about 100M, but a few were approaching 1G. We removed the files on a NFS client and immediately after the rm command returned, we experienced a serious performance problem on the 960s.
sysstat indicated that the CPU was pegged at near 100% while all I/O throughput (network, disk, FC SAN) and all file ops (NFS, FCP) dropped to almost nothing. Something grabbed the filer CPU for a minute or two which seriously impacted all of our email servers. We had to restart them all.
I suspect that the CPU load was caused by some processing having to do with recovering disk blocks freed by the file deletes. But no blocks were actually freed because the volume had snapshots that were newer than the deleted files. Perhaps the number of snapshots (41) was a factor.
I opened a case with netapp on this, but repeating the problem will have dire consequences on our production email systems, so we can't send them performance metrics.
I checked bugs online on NOW and didn't find anything that seemed to apply that wasn't marked fixed. I did see a very old bug (4157) first fixed in DOT 5.1, where WAFL would deadlock if many large files were deleted all at once.
I was just curious if anyone else has run into anything like this. We are running DOT 7.2.3. In the future when we delete a lot of big files, we'll do them one at a time, with sleeps in between.
Steve Losen scl@virginia.edu phone: 434-924-0640
University of Virginia ITC Unix Support
hi,
contact the netapp hotline and name case 2151576 as reference. we had a similar problem with.
a fix was to change a wafl flag....
hth hannes
-----Original Message----- From: Stephen C. Losen [mailto:scl@sasha.acc.virginia.edu] Sent: Freitag, 25. Jänner 2008 14:03 To: toasters@mathworks.com Subject: Deleting many large files spikes filer CPU
We have a fairly heavily loaded FAS960c pair that contains storage for our University wide email system. Most of the email storage is NFS files with the email servers running Unix and Communigate Pro. We are transitioning to MS Exchange, so these filers also have some FC SAN LUNs for our emerging Exchange service.
The other day we cleaned up about a hundred NFS email inboxes, average size about 100M, but a few were approaching 1G. We removed the files on a NFS client and immediately after the rm command returned, we experienced a serious performance problem on the 960s.
sysstat indicated that the CPU was pegged at near 100% while all I/O throughput (network, disk, FC SAN) and all file ops (NFS, FCP) dropped to almost nothing. Something grabbed the filer CPU for a minute or two which seriously impacted all of our email servers. We had to restart them all.
I suspect that the CPU load was caused by some processing having to do with recovering disk blocks freed by the file deletes. But no blocks were actually freed because the volume had snapshots that were newer than the deleted files. Perhaps the number of snapshots (41) was a factor.
I opened a case with netapp on this, but repeating the problem will have dire consequences on our production email systems, so we can't send them performance metrics.
I checked bugs online on NOW and didn't find anything that seemed to apply that wasn't marked fixed. I did see a very old bug (4157) first fixed in DOT 5.1, where WAFL would deadlock if many large files were deleted all at once.
I was just curious if anyone else has run into anything like this. We are running DOT 7.2.3. In the future when we delete a lot of big files, we'll do them one at a time, with sleeps in between.
Steve Losen scl@virginia.edu phone: 434-924-0640
University of Virginia ITC Unix Support