FC Drive failure recovery, continued... - toasters

7 Jan 2002


      Thanks once again to all the incredibly helpful folks...  Here's my 
progress so far...
I edited the disk label of the drive whose controller I replaced.  It 
looks just like the others now.
So I booted the system (readonly mode, thanks for that tip! :-), and the 
volume is actually online, and viewable, which is a major step forward.
But I'm getting a lot of errors about inconsistent directory entries:
Mon Jan  7 09:36:45 AST [/e3]: wafl_nfs_lookup: inconsistent directory 
entry {x20 0 8166461 156921645 16794490} <th.100x100.26.jpg> in {x20 0
15140401 133992804 16794490}.
I have a couple of potential courses of action right now:
- Copy all the data off that I can, regardless; this is obviously a good 
thing to do in any case.  Unfortunately, the recursive copies I attempt 
seem to hang after a short while.  Darn...
- Take my drive-with-new-controller out of the set, and let the netapp 
attempt a rebuild onto a new drive, ignoring media errors on the second 
bad drive.  I worry that this will create further corruption; but given 
the fact my current attempt has corruption, it might not be worse...? 
 (And I might not be able to get back to where I currently am.)  A 
cleaner (and safer) alternative might be to boot ignoring media errors, 
in read only mode, with my controller-replaced-drive out of the set.  In 
read only mode, it shouldn't rebuild the set, and ignoring media errors, 
it might be able to access the data in degraded mode (or at least let me 
view what is available with that method...)
- Let the netapp repair the inconsistencies; I'm not sure the best way 
to proceed on this one?
I'm working on the tape-restore method in parallel, but anything we can 
get off via some creative tweaking, would be worth the try...
I think I'll try the ignore-media-errors and read-only-mode thing next, 
to see what that view of the world is like.
-dale