Geoff,
You'll need to boot with floppy boot option 29/7, which tells RAID to
ignore media read errors. The reconstruct will then complete, but
you'll want to run WAFL_check after that, since ignoring media errors
means some blocks of data could be lost (user data or metadata).
The RAID subsystem in ONTAP was rewritten in version 6.2, and
reconstruct (among many other things) is much improved. We keep track
of reconstruct progress between reboots, and in even more recent
versions, we won't panic when we hit a media error during reconstruct.
WAFL iron will allow you do do a sanity check on the volume following
this, without having to bring the filer down for WAFL_check.
Data ONTAP 6.5 features RAID-DP, or RAID double parity, which makes this
problem disappear. The RAID group would then have double redundancy,
and could handle a media failure during the reconstruct of a single disk
without data loss or downtime. Any RAID-5 or RAID-4 system won't be
able to handle this without at least some data loss.
Hope this helps.
Steve
-----Original Message-----
From: Geoff Hardin [mailto:geoff.hardin@dalsemi.com]
Sent: Tuesday, March 16, 2004 6:23 AM
To: toasters
Subject: RAID4 Catch-22
We are stuck in a RAID 4 Catch-22 and we really can't see any way out of
it.
Here's the situation: I have a clustered pair of F760 filers running
ONTAP 6.1R1P1. Early Monday morning, disk 6.20 (72 GB drive in a DS14
shelf) failed. Immediately, the filer began to rebuild on spare disk
6.29 and (because of the large size of the volume, approximately 800
GB), the reconstruction was still running at about 7 pm. It was at 7 pm
that disk 6.25 experienced an unrecoverable read error on sector
128273044; however, since there is no parity disk available, the filer
was unable to recover this data and the filer panicked and rebooted.
Fast forward twelve hours and again we're about 90% done with
reconstruction and disk 6.25 has an unrecovered read error on sector
128273044; the filer panics, reboots, and the countdown begins again.
Does anyone know of any way out of this dilemma, short of restoring from
tape?
Many thanks in advance.
Geoff Hardin
UNIX System Administrator
Dallas Semiconductor
geoff.hardin(a)dalsemi.com