Re: raid failure - toasters

3 May 2000


      ----- Original Message ----- 
From: "Mark D Fowle" Fowle_Mark_D@CAT.com
To: "toasters" toasters@mathworks.com
Sent: Wednesday, May 03, 2000 4:00 AM
Subject: raid failure
...
I have heard a few horror stories lately about netapps and
multi-disk raid failures.
I have heard a lot more horror stories regarding single-disk
failures in non-raid situations.  That's the whole point of
getting raid... to move that pain from the chance of a single
disk failure to a double disk failure, thus reducing it
substantially.
...
Has anyone out there experienced this
and what did you do for recovery ?
I have.  If the disk failure is simply a bad block on another
drive, Netapp can sometimes product a patch that will "skip"
that block and allow you to continue to use the filesystem and
then after it's done rebuilding from the first failed disk, you can
fail and rebuild the second failed disk.  But this can be time
consuming since you'll have to recheck your filesystem for
possible corruption, and even after you fix that you have some
data files somewhere that are corrupted that you don't know
about.
Generally, in the double-disk failure case, like every other
RAID vendor (unless you're running like a +1 configuration),
you lose whatever is in that raid group/filesystem and you have
to restore it from a mirrored copy or tape backup.
...
Where there any warnings?
Back before RAID scrubbing, sometimes there were no
warnings, because when 1 drive failed and the system went
through every block on the other disks rebuilding the filesystem,
it would hit a block that had gone bad some time before and
yet had not been accessed and boom, you had a double disk
failure.
Now that you have RAID scrubbing, all those blocks are
checked every week or so, and you will get any indications of
a possibly bad drive.  Also the filer does provide warnings in
advance if it's having trouble talking to a particular drive and
you can fail it proactively and replace it without waiting for it
to fail (and increasing the risk that another drive will fail during
that time).
...
I have not had this happen and would like to do as much
as possible to prevent it.
There's not much to be done to prevent it... disks have a MTBF
and the chance of two failures in the same time period are non-
zero.  The best thing to do is follow Netapp's directions regarding
the operating environment and rack-mounting of your drives (so
you don't have excessive vibration or temperatures or whatnot),
watch your logs and fail drives if they appear to be going bad
and replace them ASAP.  Set your raid reconstruct speed high
or disable access to that filesystem during reconstruction to make
sure the reconstruction happens as quickly as possible to shorten
the time window for a second disk failure.  Make sure your RAID
scrubs run periodically and check their results.  Use smaller RAID
group sizes and have frequent backups so that when a failure does
occur it is as limited and painless as possible.  And only use disk
drives from Network Appliance.
Bruce