----- Original Message ----- From: "Mark D Fowle" Fowle_Mark_D@CAT.com To: "toasters" toasters@mathworks.com Sent: Wednesday, May 03, 2000 4:00 AM Subject: raid failure
I have heard a few horror stories lately about netapps and multi-disk raid failures.
I have heard a lot more horror stories regarding single-disk failures in non-raid situations. That's the whole point of getting raid... to move that pain from the chance of a single disk failure to a double disk failure, thus reducing it substantially.
Has anyone out there experienced this and what did you do for recovery ?
I have. If the disk failure is simply a bad block on another drive, Netapp can sometimes product a patch that will "skip" that block and allow you to continue to use the filesystem and then after it's done rebuilding from the first failed disk, you can fail and rebuild the second failed disk. But this can be time consuming since you'll have to recheck your filesystem for possible corruption, and even after you fix that you have some data files somewhere that are corrupted that you don't know about.
Generally, in the double-disk failure case, like every other RAID vendor (unless you're running like a +1 configuration), you lose whatever is in that raid group/filesystem and you have to restore it from a mirrored copy or tape backup.
Where there any warnings?
Back before RAID scrubbing, sometimes there were no warnings, because when 1 drive failed and the system went through every block on the other disks rebuilding the filesystem, it would hit a block that had gone bad some time before and yet had not been accessed and boom, you had a double disk failure.
Now that you have RAID scrubbing, all those blocks are checked every week or so, and you will get any indications of a possibly bad drive. Also the filer does provide warnings in advance if it's having trouble talking to a particular drive and you can fail it proactively and replace it without waiting for it to fail (and increasing the risk that another drive will fail during that time).
I have not had this happen and would like to do as much as possible to prevent it.
There's not much to be done to prevent it... disks have a MTBF and the chance of two failures in the same time period are non- zero. The best thing to do is follow Netapp's directions regarding the operating environment and rack-mounting of your drives (so you don't have excessive vibration or temperatures or whatnot), watch your logs and fail drives if they appear to be going bad and replace them ASAP. Set your raid reconstruct speed high or disable access to that filesystem during reconstruction to make sure the reconstruction happens as quickly as possible to shorten the time window for a second disk failure. Make sure your RAID scrubs run periodically and check their results. Use smaller RAID group sizes and have frequent backups so that when a failure does occur it is as limited and painless as possible. And only use disk drives from Network Appliance.
Bruce