the only sensible idea i found :
in newer dot version, the system doesn't simply put the data/parity disk out of the chain and rebuild on a spare,
as it can happen sometimes, (i encounter this problem once) :
- a media error producing corrupted block and leaving data on it unreadable are generally corrected on the fly by reafecting blocks
also, to recover data, the filer take profit from the raid group and recalculate the data
- if the system ever experience a data/parity disk failure and a corrupted block at the same time, whole group of file may be corrupted
(you can find them in the /lost+found directory)
- to avoid this problem, netapp add a new fonctionnality : try to read as many as possible from the disk while reconstructing the disk on a spare

also consider this :
- the disk can fail because of a controller card in it, the disk would leave off the FC chain randomly or parasite the FC chain
- the disk can fail because of too many media error on it
in the later case, the disk is still accessible but the filer know this is a bad disk because the information can be efficiently and reliable read/write to

this later case is where the filer can continue to poll the data on this disk before leave it as a broken one

now imagine that this disk send the sense key : "retryiable" after having difficulties to read the data
the filer can have many problem to poll all still-readable data (compare the normal access time 10ms for a disk ok and perhaps 15-30 second for a retry of a bad block read)

if the filer decide to wait for the read access to succeed or fail after all retry - and before serve the data - it could explain why your system was stuck at the ls prompt.

this situation worse to be invastigated and could be record as a bug
do we prefer to have a system hang or very impacted during the resconstruction time (about 5 hours at least) or do we prefer to play with the 1/1000 chance of having a bad blcok read while a "media error" broken disk occurence...

i think a mid-solution can be find

Jerry wrote:
I think there are only 5 72g disks in that raid group.
 Still, I've done this with data disks many times, and
the rebuild at "medium" is not really noticeable.  We
set it to low during the rebuild, still no effect.

We're not talking about something a performance metric
picked up, we're talking about a 30-60 sec on just an
ls (small disk i/o presumeably).  Something wasn't
right.

--- Derek Lai <Derek.Lai@onyxco.com> wrote:

  
What is your setting on
raid.reconstruct.perf_impact? The default is set to
medium. You can try to set it to low and the ls
performance might be better.
But keep in mind the rebuilds might take longer.

How many disk is in your raidgroup? Keep in mind
that when the filer is in
rebuild mode, the number of I/O skyrockets. If you
have 8 disks in the
raidgroup and one disk fails, any single I/O request
for a piece of
information store in that raidgroup is going to
cause 7X the amount of I/O
compared to normal operation. This is not peculiar
to NetApp but a normal
thing for raid.

One other way to get better performance without
having the rebuilds take
longer is to reduce the number of disk you have in
the raidgroup.


Derek

-----Original Message-----
From: Jerry [mailto:juanino@yahoo.com]
Sent: Friday, October 15, 2004 6:55 AM
To: list toasters
Subject: parity drive rebuild causing ls hangs


Anyone ever experience really bad performance when
rebuilding a parity disk?  We had a parity disk fail
on our FAS940 and when it was trying to rebuild the
disk i/o util went to 100% (observed with sysstat). 
Reads and Writes did not appear high, but I don't
think rebuild traffic effects those numbers.

During this time, "ls" was taking between 30 and 60
seconds (unacceptable).  We thought for sure this
couldn't be normal, since we've had disks fail and
rebuild many times.  The difference this time is it
was a parity disk, but I don't think that should
make
a difference other than taking a little longer to
rebuild. Sure enough, after the rebuild was complete
it started working again.  Any opinions?

Jerry

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam
protection around 
http://mail.yahoo.com 

    



		
__________________________________
Do you Yahoo!?
Yahoo! Mail - Helps protect you from nasty viruses.
http://promotions.yahoo.com/new_mail