Deleting many large files spikes filer CPU

List overview All Threads
Download

newer

older

getting Used space on a newly...

snapmirror job "hung" and abort...

Stephen C. Losen

25 Jan 2008 25 Jan '08

1:02 p.m.

We have a fairly heavily loaded FAS960c pair that contains storage for our University wide email system. Most of the email storage is NFS files with the email servers running Unix and Communigate Pro. We are transitioning to MS Exchange, so these filers also have some FC SAN LUNs for our emerging Exchange service.

The other day we cleaned up about a hundred NFS email inboxes, average size about 100M, but a few were approaching 1G. We removed the files on a NFS client and immediately after the rm command returned, we experienced a serious performance problem on the 960s.

sysstat indicated that the CPU was pegged at near 100% while all I/O throughput (network, disk, FC SAN) and all file ops (NFS, FCP) dropped to almost nothing. Something grabbed the filer CPU for a minute or two which seriously impacted all of our email servers. We had to restart them all.

I suspect that the CPU load was caused by some processing having to do with recovering disk blocks freed by the file deletes. But no blocks were actually freed because the volume had snapshots that were newer than the deleted files. Perhaps the number of snapshots (41) was a factor.

I opened a case with netapp on this, but repeating the problem will have dire consequences on our production email systems, so we can't send them performance metrics.

I checked bugs online on NOW and didn't find anything that seemed to apply that wasn't marked fixed. I did see a very old bug (4157) first fixed in DOT 5.1, where WAFL would deadlock if many large files were deleted all at once.

I was just curious if anyone else has run into anything like this. We are running DOT 7.2.3. In the future when we delete a lot of big files, we'll do them one at a time, with sleeps in between.

Steve Losen scl@virginia.edu phone: 434-924-0640

University of Virginia ITC Unix Support

Show replies by date

Chris Blackmor

25 Jan 25 Jan

2:50 p.m.

There is a known issue regarding large file deletions. I know that NA is actively working this but I cannot speak to an ETA on it's fix.

The work around at this point is "don't do that", or at least, "don't do that all at once". Yes, it does seem silly but until they have a fix for this, that's all anyone can do. Your workaround is the "right" thing at this point. C-

Stephen C. Losen wrote:

...

We have a fairly heavily loaded FAS960c pair that contains storage for our University wide email system. Most of the email storage is NFS files with the email servers running Unix and Communigate Pro. We are transitioning to MS Exchange, so these filers also have some FC SAN LUNs for our emerging Exchange service.

The other day we cleaned up about a hundred NFS email inboxes, average size about 100M, but a few were approaching 1G. We removed the files on a NFS client and immediately after the rm command returned, we experienced a serious performance problem on the 960s.

sysstat indicated that the CPU was pegged at near 100% while all I/O throughput (network, disk, FC SAN) and all file ops (NFS, FCP) dropped to almost nothing. Something grabbed the filer CPU for a minute or two which seriously impacted all of our email servers. We had to restart them all.

I suspect that the CPU load was caused by some processing having to do with recovering disk blocks freed by the file deletes. But no blocks were actually freed because the volume had snapshots that were newer than the deleted files. Perhaps the number of snapshots (41) was a factor.

I opened a case with netapp on this, but repeating the problem will have dire consequences on our production email systems, so we can't send them performance metrics.

I checked bugs online on NOW and didn't find anything that seemed to apply that wasn't marked fixed. I did see a very old bug (4157) first fixed in DOT 5.1, where WAFL would deadlock if many large files were deleted all at once.

I was just curious if anyone else has run into anything like this. We are running DOT 7.2.3. In the future when we delete a lot of big files, we'll do them one at a time, with sleeps in between.

Steve Losen scl@virginia.edu phone: 434-924-0640

University of Virginia ITC Unix Support

-- ----------------------------------------------------------------------------- * Chris Blackmor _______ | * * Advanced Micro Devices ____ | | A good horse never comes * * Phone: (512) 602-1608 /| | | | in a bad color! * * Fax: (512) 602-5155 | |___| | | * * Email: chris.blackmor@amd.com |____/ | | Author Unknown* ----------------------------------------------------------------------------- * My comments are mine, and mine alone. * -----------------------------------------------------------------------------

Blake Golliher

6:07 p.m.

I've seen this too. it can happen with large file deletions and with many many small file deletions. Mostly it has to do with running out of zombie processes to reap the deletes. As Chris said, your best bet is to delete slowly and cautiously.

-Blake

On Jan 25, 2008 6:50 AM, Chris Blackmor chris.blackmor@amd.com wrote:

...

There is a known issue regarding large file deletions. I know that NA is actively working this but I cannot speak to an ETA on it's fix.

The work around at this point is "don't do that", or at least, "don't do that all at once". Yes, it does seem silly but until they have a fix for this, that's all anyone can do. Your workaround is the "right" thing at this point. C-

Stephen C. Losen wrote:

...
We have a fairly heavily loaded FAS960c pair that contains storage for our University wide email system. Most of the email storage is NFS files with the email servers running Unix and Communigate Pro. We are transitioning to MS Exchange, so these filers also have some FC SAN LUNs for our emerging Exchange service.

The other day we cleaned up about a hundred NFS email inboxes, average size about 100M, but a few were approaching 1G. We removed the files on a NFS client and immediately after the rm command returned, we experienced a serious performance problem on the 960s.

sysstat indicated that the CPU was pegged at near 100% while all I/O throughput (network, disk, FC SAN) and all file ops (NFS, FCP) dropped to almost nothing. Something grabbed the filer CPU for a minute or two which seriously impacted all of our email servers. We had to restart them all.

I suspect that the CPU load was caused by some processing having to do with recovering disk blocks freed by the file deletes. But no blocks were actually freed because the volume had snapshots that were newer than the deleted files. Perhaps the number of snapshots (41) was a factor.

I opened a case with netapp on this, but repeating the problem will have dire consequences on our production email systems, so we can't send them performance metrics.

I checked bugs online on NOW and didn't find anything that seemed to apply that wasn't marked fixed. I did see a very old bug (4157) first fixed in DOT 5.1, where WAFL would deadlock if many large files were deleted all at once.

I was just curious if anyone else has run into anything like this. We are running DOT 7.2.3. In the future when we delete a lot of big files, we'll do them one at a time, with sleeps in between.

Steve Losen scl@virginia.edu phone: 434-924-0640

University of Virginia ITC Unix Support

--

Chris Blackmor _______ | *

Advanced Micro Devices ____ | | A good horse never comes *

Phone: (512) 602-1608 /| | | | in a bad color! *

Fax: (512) 602-5155 | |___| | | *

Email: chris.blackmor@amd.com |____/ | | Author Unknown*
                My comments are mine, and mine alone.                 *

Clear, John

7:02 p.m.

The bug is 90314, but the bug description doesn't have any more details then what's been on here.

John

-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Blake Golliher Sent: Friday, January 25, 2008 10:08 AM To: Blackmor, Chris; Stephen C. Losen Cc: toasters@mathworks.com Subject: Re: Deleting many large files spikes filer CPU

-Blake

On Jan 25, 2008 6:50 AM, Chris Blackmor chris.blackmor@amd.com wrote:

...

There is a known issue regarding large file deletions. I know that NA is actively working this but I cannot speak to an ETA on it's fix.

The work around at this point is "don't do that", or at least, "don't do that all at once". Yes, it does seem silly but until they have a fix for this, that's all anyone can do. Your workaround is the "right" thing at this point. C-

Stephen C. Losen wrote:

...
We have a fairly heavily loaded FAS960c pair that contains storage for our University wide email system. Most of the email storage is NFS files with the email servers running Unix and Communigate

Pro.

...

...
We are transitioning to MS Exchange, so these filers also have some FC SAN LUNs for our emerging Exchange service.

The other day we cleaned up about a hundred NFS email inboxes,

average size

...

...
about 100M, but a few were approaching 1G. We removed the files on

a NFS

...

...
client and immediately after the rm command returned, we experienced

...

...
serious performance problem on the 960s.

sysstat indicated that the CPU was pegged at near 100% while all I/O throughput (network, disk, FC SAN) and all file ops (NFS, FCP)

dropped to

...

...
almost nothing. Something grabbed the filer CPU for a minute or two

which

...

...
seriously impacted all of our email servers. We had to restart them

all.

...

...
I suspect that the CPU load was caused by some processing having to

do with

...

...
recovering disk blocks freed by the file deletes. But no blocks

were

...

...
actually freed because the volume had snapshots that were newer than

the

...

...
deleted files. Perhaps the number of snapshots (41) was a factor.

I opened a case with netapp on this, but repeating the problem will

have

...

...
dire consequences on our production email systems, so we can't send

them

...

...
performance metrics.

I checked bugs online on NOW and didn't find anything that seemed to

apply

...

...
that wasn't marked fixed. I did see a very old bug (4157) first

fixed in

...

...
DOT 5.1, where WAFL would deadlock if many large files were deleted

all at

...

...
once.

I was just curious if anyone else has run into anything like this. We are running DOT 7.2.3. In the future when we delete a lot of big files, we'll do them one at a time, with sleeps in between.

Steve Losen scl@virginia.edu phone: 434-924-0640

University of Virginia ITC Unix Support

--

------------------------------------------------------------------------ -----

...

Chris Blackmor _______ |

...

Advanced Micro Devices ____ | | A good horse never

comes *

...

Phone: (512) 602-1608 /| | | | in a bad color!

...

Fax: (512) 602-5155 | |___| | |

...

Email: chris.blackmor@amd.com |____/ | | Author

Unknown*

...

------------------------------------------------------------------------ -----

...

                My comments are mine, and mine alone.

...

------------------------------------------------------------------------ -----

...

Glenn Walker

10:25 p.m.

Ditto - I've run into that very bug. Only workaround is to delete more slowly\methodically. I will say that it's not nearly as bad in recent releases as it was in earlier ones.

-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Clear, John Sent: Friday, January 25, 2008 2:02 PM To: Blake Golliher; Blackmor, Chris; Stephen C. Losen Cc: toasters@mathworks.com Subject: RE: Deleting many large files spikes filer CPU

The bug is 90314, but the bug description doesn't have any more details then what's been on here.

John

-Blake

On Jan 25, 2008 6:50 AM, Chris Blackmor chris.blackmor@amd.com wrote:

...

There is a known issue regarding large file deletions. I know that NA is actively working this but I cannot speak to an ETA on it's fix.

The work around at this point is "don't do that", or at least, "don't do that all at once". Yes, it does seem silly but until they have a fix for this, that's all anyone can do. Your workaround is the "right" thing at this point. C-

Stephen C. Losen wrote:

...
We have a fairly heavily loaded FAS960c pair that contains storage for our University wide email system. Most of the email storage is NFS files with the email servers running Unix and Communigate

Pro.

...

...
We are transitioning to MS Exchange, so these filers also have some FC SAN LUNs for our emerging Exchange service.

The other day we cleaned up about a hundred NFS email inboxes,

average size

...

...
about 100M, but a few were approaching 1G. We removed the files on

a NFS

...

...
client and immediately after the rm command returned, we experienced

...

...
serious performance problem on the 960s.

sysstat indicated that the CPU was pegged at near 100% while all I/O throughput (network, disk, FC SAN) and all file ops (NFS, FCP)

dropped to

...

...
almost nothing. Something grabbed the filer CPU for a minute or two

which

...

...
seriously impacted all of our email servers. We had to restart them

all.

...

...
I suspect that the CPU load was caused by some processing having to

do with

...

...
recovering disk blocks freed by the file deletes. But no blocks

were

...

...
actually freed because the volume had snapshots that were newer than

the

...

...
deleted files. Perhaps the number of snapshots (41) was a factor.

I opened a case with netapp on this, but repeating the problem will

have

...

...
dire consequences on our production email systems, so we can't send

them

...

...
performance metrics.

I checked bugs online on NOW and didn't find anything that seemed to

apply

...

...
that wasn't marked fixed. I did see a very old bug (4157) first

fixed in

...

...
DOT 5.1, where WAFL would deadlock if many large files were deleted

all at

...

...
once.

I was just curious if anyone else has run into anything like this. We are running DOT 7.2.3. In the future when we delete a lot of big files, we'll do them one at a time, with sleeps in between.

Steve Losen scl@virginia.edu phone: 434-924-0640

University of Virginia ITC Unix Support

--

------------------------------------------------------------------------ -----

...

Chris Blackmor _______ |

...

Advanced Micro Devices ____ | | A good horse never

comes *

...

Phone: (512) 602-1608 /| | | | in a bad color!

...

Fax: (512) 602-5155 | |___| | |

...

Email: chris.blackmor@amd.com |____/ | | Author

Unknown*

...

------------------------------------------------------------------------ -----

...

                My comments are mine, and mine alone.

...

------------------------------------------------------------------------ -----

...

Willeke, Jochen

28 Jan 28 Jan

10:21 a.m.

Hi Stephen,

we had a similar issue some time ago. After having opened a call with netapp the engineering finally gave us some options one can set.

Sadly i do not have the options by the hand but maybe you can contact your netapp colleagues or open a case. For us the options did their work.

Best Regards

Jochen

-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Stephen C. Losen Sent: Friday, January 25, 2008 2:03 PM To: toasters@mathworks.com Subject: Deleting many large files spikes filer CPU

I opened a case with netapp on this, but repeating the problem will have dire consequences on our production email systems, so we can't send them performance metrics.

I was just curious if anyone else has run into anything like this. We are running DOT 7.2.3. In the future when we delete a lot of big files, we'll do them one at a time, with sleeps in between.

Steve Losen scl@virginia.edu phone: 434-924-0640

University of Virginia ITC Unix Support

-- Wincor Nixdorf International GmbH Sitz der Gesellschaft: Paderborn Registergericht Paderborn HRB 3507 Gesch�ftsf�hrer: Eckard Heidloff (Vorsitzender), Stefan Auerbach, J�rgen Wilde, Dr. J�rgen Wunram Vorsitzender des Aufsichtsrats: Karl-Heinz Stiller Steuernummer: 339/5884/0020 - Ust-ID Nr.: DE812927716 - WEEE-Reg.-Nr. DE44477193 Diese E-Mail enth�lt vertrauliche Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrt�mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet. This e-mail may contain confidential information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure or distribution of the material in this e-mail is strictly forbidden.

Herret, Hannes

6 Feb 6 Feb

9:38 p.m.

hi,

contact the netapp hotline and name case 2151576 as reference. we had a similar problem with.

a fix was to change a wafl flag....

hth hannes

-----Original Message----- From: Stephen C. Losen [mailto:scl@sasha.acc.virginia.edu] Sent: Freitag, 25. Jänner 2008 14:03 To: toasters@mathworks.com Subject: Deleting many large files spikes filer CPU

I opened a case with netapp on this, but repeating the problem will have dire consequences on our production email systems, so we can't send them performance metrics.

I was just curious if anyone else has run into anything like this. We are running DOT 7.2.3. In the future when we delete a lot of big files, we'll do them one at a time, with sleeps in between.

Steve Losen scl@virginia.edu phone: 434-924-0640

University of Virginia ITC Unix Support

6406

Age (days ago)

6418

Last active (days ago)

toasters@lists.teaparty.net

6 comments

7 participants

tags (0)

participants (7)

Blake Golliher
Chris Blackmor
Clear, John
Glenn Walker
Herret, Hannes
Stephen C. Losen
Willeke, Jochen