What is your setting on raid.reconstruct.perf_impact? The default is set to medium. You can try to set it to low and the ls performance might be better. But keep in mind the rebuilds might take longer.
How many disk is in your raidgroup? Keep in mind that when the filer is in rebuild mode, the number of I/O skyrockets. If you have 8 disks in the raidgroup and one disk fails, any single I/O request for a piece of information store in that raidgroup is going to cause 7X the amount of I/O compared to normal operation. This is not peculiar to NetApp but a normal thing for raid.
One other way to get better performance without having the rebuilds take longer is to reduce the number of disk you have in the raidgroup.
Derek
-----Original Message----- From: Jerry [mailto:juanino@yahoo.com] Sent: Friday, October 15, 2004 6:55 AM To: list toasters Subject: parity drive rebuild causing ls hangs
Anyone ever experience really bad performance when rebuilding a parity disk? We had a parity disk fail on our FAS940 and when it was trying to rebuild the disk i/o util went to 100% (observed with sysstat). Reads and Writes did not appear high, but I don't think rebuild traffic effects those numbers.
During this time, "ls" was taking between 30 and 60 seconds (unacceptable). We thought for sure this couldn't be normal, since we've had disks fail and rebuild many times. The difference this time is it was a parity disk, but I don't think that should make a difference other than taking a little longer to rebuild. Sure enough, after the rebuild was complete it started working again. Any opinions?
Jerry
__________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
I think there are only 5 72g disks in that raid group. Still, I've done this with data disks many times, and the rebuild at "medium" is not really noticeable. We set it to low during the rebuild, still no effect.
We're not talking about something a performance metric picked up, we're talking about a 30-60 sec on just an ls (small disk i/o presumeably). Something wasn't right.
--- Derek Lai Derek.Lai@onyxco.com wrote:
What is your setting on raid.reconstruct.perf_impact? The default is set to medium. You can try to set it to low and the ls performance might be better. But keep in mind the rebuilds might take longer.
How many disk is in your raidgroup? Keep in mind that when the filer is in rebuild mode, the number of I/O skyrockets. If you have 8 disks in the raidgroup and one disk fails, any single I/O request for a piece of information store in that raidgroup is going to cause 7X the amount of I/O compared to normal operation. This is not peculiar to NetApp but a normal thing for raid.
One other way to get better performance without having the rebuilds take longer is to reduce the number of disk you have in the raidgroup.
Derek
-----Original Message----- From: Jerry [mailto:juanino@yahoo.com] Sent: Friday, October 15, 2004 6:55 AM To: list toasters Subject: parity drive rebuild causing ls hangs
Anyone ever experience really bad performance when rebuilding a parity disk? We had a parity disk fail on our FAS940 and when it was trying to rebuild the disk i/o util went to 100% (observed with sysstat). Reads and Writes did not appear high, but I don't think rebuild traffic effects those numbers.
During this time, "ls" was taking between 30 and 60 seconds (unacceptable). We thought for sure this couldn't be normal, since we've had disks fail and rebuild many times. The difference this time is it was a parity disk, but I don't think that should make a difference other than taking a little longer to rebuild. Sure enough, after the rebuild was complete it started working again. Any opinions?
Jerry
Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
__________________________________ Do you Yahoo!? Yahoo! Mail - Helps protect you from nasty viruses. http://promotions.yahoo.com/new_mail
the only sensible idea i found : in newer dot version, the system doesn't simply put the data/parity disk out of the chain and rebuild on a spare, as it can happen sometimes, (i encounter this problem once) : - a media error producing corrupted block and leaving data on it unreadable are generally corrected on the fly by reafecting blocks also, to recover data, the filer take profit from the raid group and recalculate the data - if the system ever experience a data/parity disk failure and a corrupted block at the same time, whole group of file may be corrupted (you can find them in the /lost+found directory) - to avoid this problem, netapp add a new fonctionnality : try to read as many as possible from the disk while reconstructing the disk on a spare
also consider this : - the disk can fail because of a controller card in it, the disk would leave off the FC chain randomly or parasite the FC chain - the disk can fail because of too many media error on it in the later case, the disk is still accessible but the filer know this is a bad disk because the information can be efficiently and reliable read/write to
this later case is where the filer can continue to poll the data on this disk before leave it as a broken one
now imagine that this disk send the sense key : "retryiable" after having difficulties to read the data the filer can have many problem to poll all still-readable data (compare the normal access time 10ms for a disk ok and perhaps 15-30 second for a retry of a bad block read)
if the filer decide to wait for the read access to succeed or fail after all retry - and before serve the data - it could explain why your system was stuck at the ls prompt.
this situation worse to be invastigated and could be record as a bug do we prefer to have a system hang or very impacted during the resconstruction time (about 5 hours at least) or do we prefer to play with the 1/1000 chance of having a bad blcok read while a "media error" broken disk occurence...
i think a mid-solution can be find
Jerry wrote:
I think there are only 5 72g disks in that raid group. Still, I've done this with data disks many times, and the rebuild at "medium" is not really noticeable. We set it to low during the rebuild, still no effect.
We're not talking about something a performance metric picked up, we're talking about a 30-60 sec on just an ls (small disk i/o presumeably). Something wasn't right.
--- Derek Lai Derek.Lai@onyxco.com wrote:
What is your setting on raid.reconstruct.perf_impact? The default is set to medium. You can try to set it to low and the ls performance might be better. But keep in mind the rebuilds might take longer.
How many disk is in your raidgroup? Keep in mind that when the filer is in rebuild mode, the number of I/O skyrockets. If you have 8 disks in the raidgroup and one disk fails, any single I/O request for a piece of information store in that raidgroup is going to cause 7X the amount of I/O compared to normal operation. This is not peculiar to NetApp but a normal thing for raid.
One other way to get better performance without having the rebuilds take longer is to reduce the number of disk you have in the raidgroup.
Derek
-----Original Message----- From: Jerry [mailto:juanino@yahoo.com] Sent: Friday, October 15, 2004 6:55 AM To: list toasters Subject: parity drive rebuild causing ls hangs
Anyone ever experience really bad performance when rebuilding a parity disk? We had a parity disk fail on our FAS940 and when it was trying to rebuild the disk i/o util went to 100% (observed with sysstat). Reads and Writes did not appear high, but I don't think rebuild traffic effects those numbers.
During this time, "ls" was taking between 30 and 60 seconds (unacceptable). We thought for sure this couldn't be normal, since we've had disks fail and rebuild many times. The difference this time is it was a parity disk, but I don't think that should make a difference other than taking a little longer to rebuild. Sure enough, after the rebuild was complete it started working again. Any opinions?
Jerry
Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
__________________________________ Do you Yahoo!? Yahoo! Mail - Helps protect you from nasty viruses. http://promotions.yahoo.com/new_mail