parity drive rebuild causing ls hangs

List overview All Threads
Download

newer

older

Downrev from 6.5.1R1 to 6.4.5

New disk from Seagate : ST373307FC

Jerry

15 Oct 2004 15 Oct '04

1:55 p.m.

Anyone ever experience really bad performance when rebuilding a parity disk? We had a parity disk fail on our FAS940 and when it was trying to rebuild the disk i/o util went to 100% (observed with sysstat). Reads and Writes did not appear high, but I don't think rebuild traffic effects those numbers.

During this time, "ls" was taking between 30 and 60 seconds (unacceptable). We thought for sure this couldn't be normal, since we've had disks fail and rebuild many times. The difference this time is it was a parity disk, but I don't think that should make a difference other than taking a little longer to rebuild. Sure enough, after the rebuild was complete it started working again. Any opinions?

Jerry

__________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

Show replies by date

Dirk Schmiedt

16 Oct 16 Oct

8:15 a.m.

Hello Jerry

I need more informations:

Which ONTAP Version are you using? With or without disk prefailing/copying? Did you start any ps -z ; wait one minute ; ps -c 1 traces? => Which processes make your CPU busy? Did you use statit (for showing usage and response times of every single disk) or wafl_susp (suspense reasons inside WAFL) for analysis?

Basically, I also would be astonished if the reconstruct of a "parity" disk would be the reason of this effect. I "kill" at least one random disk per week for demonstration/training purposes and haven't this effect yet (luckily :-) or sadly :-( because it would be interesting to analyze it ;-) )

Regards! Dirk

Jerry wrote:

...

Anyone ever experience really bad performance when rebuilding a parity disk?

...

-- oo _\o oo__o' oo _\o Erfolg haben ist: /\ \ o/ ' /\ \ Einmal oefter / o oo / aufstehen als man ------oo----<^^^^^>______o_'_------oo---- hinfaellt. ;-) Success is: Stand up one more often than the times you stumbled.

Jerry

17 Oct 17 Oct

12:40 a.m.

I too am astonished, but it was definitely it. The split second it was done rebuilding the problem went away. I was sitting there recalling my command history and my co-worker was running various ls commands.

I did not run any diagnostic commands such as statit. By the time we tracked it down to being the netapp in the first place, we were busy failing over apps to another site and scurrying around. I think we are using 6.4.1 (not connected to work right now). Other volumes were not effected. Have you ever yanked a parity disk?

--- Dirk Schmiedt Dirk.Schmiedt@munich.netsurf.de wrote:

...

Hello Jerry

I need more informations:

Which ONTAP Version are you using? With or without disk prefailing/copying? Did you start any ps -z ; wait one minute ; ps -c 1 traces? => Which processes make your CPU busy? Did you use statit (for showing usage and response times of every single disk) or wafl_susp (suspense reasons inside WAFL) for analysis?

Basically, I also would be astonished if the reconstruct of a "parity" disk would be the reason of this effect. I "kill" at least one random disk per week for demonstration/training purposes and haven't this effect yet (luckily :-) or sadly :-( because it would be interesting to analyze it ;-) )

Regards! Dirk

Jerry wrote:

...
Anyone ever experience really bad performance when rebuilding a parity disk?

...

-- oo _\o oo__o' oo _\o Erfolg haben ist: /\ \ o/ ' /\ \ Einmal oefter / o oo / aufstehen als man ------oo----<^^^^^>______o_'_------oo---- hinfaellt. ;-)

Success is: Stand up one more often than the times you stumbled.

_______________________________ Do you Yahoo!? Declare Yourself - Register online to vote today! http://vote.yahoo.com

Dirk Schmiedt

3:52 p.m.

Hello Jerry

...

I too am astonished, but it was definitely it. The split second it was done rebuilding the problem went away. I was sitting there recalling my command history and my co-worker was running various ls commands.

Things that come into my head: -Unfixed bug *79418: *When the option raid.reconstruct.perf_impact is low, the FilerView RAID Reconstruct Speed is high; and when the option raid.reconstruct.perf_impact is high, the FilerView RAID Reconstruct Speed is low. -Reconstructing a parity should be much faster than a data disk reconstruct, because a parallel reading of the user data doesn't require an out-of-band reconstruct. Only the reconstruct is done sequentially. -Maybe you faced an other additional reconstructable diskfailure which forced a higher cpu-consuming wafl-ironing/filesystem-checking? -May you had physical problems on the disk or fc-layer? command: fcadmin -I will try do reconstruct your problem next week.

...

I did not run any diagnostic commands such as statit. By the time we tracked it down to being the netapp in the first place, we were busy failing over apps to another site and scurrying around. I think we are using 6.4.1 (not connected to work right now). Other volumes were not effected. Have you ever yanked a parity disk?

We yank everything. :-) We have 16 training filers ( 8 per class ) and I give appr. two classes/month. We usually kill all kinds of disks on all filers. Some kill single disks (data or parity), some force multiple disk errors by using disk fail or pulling them out physically. So we have an average of two parity failures per week. And yes, we use "hammer", "sio" and other load generation tools. ;-)

So I can tell you, that pulling out the two mailbox-disks of the root-volume at the same time will panic your filer even if you have raid-dp. The filer needs a delay of 10 seconds to activate a replacement disk and stamp it as a mailboxdisk before the second mbx disk is allowed to fail. If not, the filesystem still can be reconstructed, but the clustermailbox information is lost. => Spread your root volume over multiple FCs and use SyncMirror to have a 4 mailbox disks redundancy for 99,99...% highavailability.

Best regards Dirk

Jerry

18 Oct 18 Oct

3:49 p.m.

In addition, the drive was yanked before it actually failed. I've been told it had an led light on, although it was working (yellow I think).

--- Dirk Schmiedt Dirk.Schmiedt@munich.netsurf.de wrote:

...

Hello Jerry

...
I too am astonished, but it was definitely it. The split second it was done rebuilding the problem

went

...
away. I was sitting there recalling my command history and my co-worker was running various ls commands.

Things that come into my head: -Unfixed bug *79418: *When the option raid.reconstruct.perf_impact is low, the FilerView RAID Reconstruct Speed is high; and when the option raid.reconstruct.perf_impact is high, the FilerView RAID Reconstruct Speed is low. -Reconstructing a parity should be much faster than a data disk reconstruct, because a parallel reading of the user data doesn't require an out-of-band reconstruct. Only the reconstruct is done sequentially. -Maybe you faced an other additional reconstructable diskfailure which forced a higher cpu-consuming wafl-ironing/filesystem-checking? -May you had physical problems on the disk or fc-layer? command: fcadmin -I will try do reconstruct your problem next week.

...
I did not run any diagnostic commands such as

statit.

...
By the time we tracked it down to being the netapp

in

...
the first place, we were busy failing over apps to another site and scurrying around. I think we are using 6.4.1 (not connected to work right now).

Other

...
volumes were not effected. Have you ever yanked a parity disk?

We yank everything. :-) We have 16 training filers ( 8 per class ) and I give appr. two classes/month. We usually kill all kinds of disks on all filers. Some kill single disks (data or parity), some force multiple disk errors by using disk fail or pulling them out physically. So we have an average of two parity failures per week. And yes, we use "hammer", "sio" and other load generation tools. ;-)

So I can tell you, that pulling out the two mailbox-disks of the root-volume at the same time will panic your filer even if you have raid-dp. The filer needs a delay of 10 seconds to activate a replacement disk and stamp it as a mailboxdisk before the second mbx disk is allowed to fail. If not, the filesystem still can be reconstructed, but the clustermailbox information is lost. => Spread your root volume over multiple FCs and use SyncMirror to have a 4 mailbox disks redundancy for 99,99...% highavailability.

Best regards Dirk

__________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

7666

Age (days ago)

7669

Last active (days ago)

toasters@lists.teaparty.net

4 comments

2 participants

tags (0)

participants (2)

Dirk Schmiedt
Jerry