Geoff,
You'll need to boot with floppy boot option 29/7, which tells RAID to ignore media read errors. The reconstruct will then complete, but you'll want to run WAFL_check after that, since ignoring media errors means some blocks of data could be lost (user data or metadata).
The RAID subsystem in ONTAP was rewritten in version 6.2, and reconstruct (among many other things) is much improved. We keep track of reconstruct progress between reboots, and in even more recent versions, we won't panic when we hit a media error during reconstruct. WAFL iron will allow you do do a sanity check on the volume following this, without having to bring the filer down for WAFL_check.
Data ONTAP 6.5 features RAID-DP, or RAID double parity, which makes this problem disappear. The RAID group would then have double redundancy, and could handle a media failure during the reconstruct of a single disk without data loss or downtime. Any RAID-5 or RAID-4 system won't be able to handle this without at least some data loss.
Hope this helps.
Steve
-----Original Message----- From: Geoff Hardin [mailto:geoff.hardin@dalsemi.com] Sent: Tuesday, March 16, 2004 6:23 AM To: toasters Subject: RAID4 Catch-22
We are stuck in a RAID 4 Catch-22 and we really can't see any way out of it.
Here's the situation: I have a clustered pair of F760 filers running ONTAP 6.1R1P1. Early Monday morning, disk 6.20 (72 GB drive in a DS14 shelf) failed. Immediately, the filer began to rebuild on spare disk 6.29 and (because of the large size of the volume, approximately 800 GB), the reconstruction was still running at about 7 pm. It was at 7 pm
that disk 6.25 experienced an unrecoverable read error on sector 128273044; however, since there is no parity disk available, the filer was unable to recover this data and the filer panicked and rebooted. Fast forward twelve hours and again we're about 90% done with reconstruction and disk 6.25 has an unrecovered read error on sector 128273044; the filer panics, reboots, and the countdown begins again.
Does anyone know of any way out of this dilemma, short of restoring from
tape?
Many thanks in advance.
Geoff Hardin UNIX System Administrator Dallas Semiconductor geoff.hardin@dalsemi.com
Steve Strange writes: [...]
Data ONTAP 6.5 features RAID-DP, or RAID double parity, which makes this problem disappear. The RAID group would then have double redundancy, and could handle a media failure during the reconstruct of a single disk without data loss or downtime.
Can anyone elaborate on how this works, in detail?
Chris Thompson Email: cet1@cam.ac.uk