Re: Replacing a FC Drive Controller Card...

4 Jan 2002

      Wow!  You guys are incredibly helpful.  This list is quite a find indeed :-)
Here's my progress so far (yes, I know this is all very scary stuff, and 
I stand a good chance of having to go to tape :-):
1. Tried putting the old controller card back in, to try the 
disk_fw_update_fix to attempt to fix the spinup problem with the old 
controller card.  No dice.  In order to apply the firmware update, it 
needs to at least see the drive, and with the old controller card, it's 
not even seeing the drive.  It scans the other drives, but doesn't apply 
the firmware, as it says the firmware patch is only for 118202FC drives, 
and I have all 118203FC.  So I'm guessing there's no reviving that old 
controller card; it's not a spin-up issue, but a dead controller, period.
2. So, next logical step was to put the new controller card back on the 
old drive, since that at least allows the drive to wake up and see the 
world.  Now, it's flagged as a hot spare, so the best course of action 
seems to be to modify the label, back to the raid set member it used to 
be, before I swapped controllers.  The "secret menu" has options for 
listing all the lablels.  Doing so, I can clearly see that there is one 
obvious missing entry, which can easily be reconstructed.  There is:
...
7.16 : RAID/1009983010/408/0 100437a/0/10/38/ RAID/1009983010/408/0 
100437a/0/10/38/
7.16 : RAID/1009983010/408/0 100437a/0/12/38/ RAID/1009983010/408/0 
100437a/0/10/38/
    ...
Everything from /0/0/38/ to 0/13/38/ is there for that raid group, but 
no 0/11/38, so that's clearly the missing drive.  So I want to relabel 
7.26 as
7.16 : RAID/1009983010/408/0 100437a/0/11/38/ RAID/1009983010/408/0 
100437a/0/10/38/
In the menu system which lets you edit a disk label, you have to type in 
the name of which fields you want to modify.  When printing the labels, 
the headers indicate magic/time/gen/shutdown and fsid/rgid/rgdn/total, 
which is what I'm using as field names to try and edit the label.  I'm 
guessing they mean something like this:
Magic - What type the disk is RAID or SPARE
Time - Time the raid group was created
Gen - I don't know; some unique ID per raid group, I assume
Shutdown - Shutdown time?
Fsid - File system ID
Rgid - Raid group id
Rgdn - Raid group disk number
Total - Total disks in the raid group
So, I'm able to type in "magic" to set the magic, "time" to set the 
time, "fsid" to set the file system ID, "rgdn" to set the raid group 
disk number properly.  (I can't set "total", but I guess that's 
computed.)  However, I can't seem to set the "gen".  "gen" isn't 
accepted as a field name.  Without setting it, my disk label shows up as 
this:
7.26 : RAID/1009983010/0/0 100437a/0/11/1/ RAID/1009983010/0/0 
100437a/0/11/1/
So it looks like it's in it's on raid group (with a total of 1 disks).
Does anyone know how to set that third field (that is 408 in my case), 
known as "gen"????  I think I'm close on this, just need that one more 
bit of information :-)
If it doesn't join the raid group happily after getting this labelled 
right, or the raid doesn't rebuild, then my last course of action will 
be to pull that drive, and use the "ignore medium errors" on a raid 
rebuild.  If the medium errors are kept to a minimum, I should have most 
of the data.  If not, oh well, at least I tried :-)  And it's on to the 
lengthy tape restore...
Thanks for any help...
-dale
Dale Gass wrote:
...
Hey, got a couple of NetApp questions; they're a little low-level, and 
undoubtedly unsupported and warranty-voiding, etc., but any advice 
would be appreciated...  (We don't have a support contract, due to the 
costs.)
After moving a network appliance, it woke up kind of grumpy.  One of 
the drives (Seagate Cheetah 18G FC-AL) didn't respond at all.  
Normally, when powering on a shelf, the green light typically goes 
off, blinks a bit, and then comes on solid.  With this drive, the 
light just stays on solid, from the instant the switch is flipped on 
for the shelf.  Pretty much non-responsive.
So the unit attempts a rebuild.  However, during the rebuild, 
unrecoverable media errors are encountered on another drive in the 
raid set.  Sigh...  "File system may be scrambled."
So...  With the netapp off, I try another drive in the non responsive 
shelf's slot.  It's status light does the normal thing, so the shelf 
is okay, something is wrong with the drive or controller.  Given the 
fact the light doesn't blink at all upon initialization, I suspect the 
controller.
Since I had another spare drive, of exactly the same make/model, I 
tried swapping the controller card with the non-responsive unit.  
Doing so, then powering on the shelf, gives the normal status light 
sequence (blinking, then solid).  A good sign so far.
Then...  Powering on the netapp: it says the raid set is still missing 
a drive, and shows the drive with the new controller, as 
"uninitialized", and assigns it to a hot spare, and then tries the 
rebuild again (which fails on the media errors on the other drive...)
So.  I'm guessing the NetApp uses the drive's serial # (which is on 
the controller card, not the drive, I presume) to keep track of the 
drive's function.  I guess my three questions are as follows:

Is there any way to tell the NetApp that a drive's serial # has

changed?  (Where is this low-level raid configuration data stored?  In 
NVRAM I assume?  I looked around the files in /vol/vol0/etc, but 
nothing looked appropriate.)

Does the fact that the drive was flagged as a hot spare actually

cause anything to be written to the drive, or is it just noted as such 
in the NetApp's configuration?  (Also, since a rebuild was attempted, 
does that mean my data was overwritten?  I guess since it was a 
rebuild, any data that was successfully rebuilt before the media 
errors, should be the same as was on the drive before, right?  Or 
not...?  The drive #was 26.  There was another hot spare at #24; #26 
is listed first in the host spare list on boot; would the 
lower-numbered or listed-first hot spare tend to be used?  The filer 
didn't indicate *which* hot spare it was starting to use in the rebuild.)

Each time that the filer starts up, it claims to recover different

blocks on the medium error, and claim they'll be reassigned.  Is there 
any way to force the retries on this higher, so it will be more 
determined at recovery on the medium errors?  Since different sectors 
are successfully read different times, one would think that if it did 
the reassignment each time, it would eventually recover the drive 
(maybe).
Dual failures really suck on raid.  I'm hoping that there's a way to 
bring this back to life.
(We do have the data on tape; but due to a number of circumstances I 
won't go into here, restoring it would be *very* laborious, so I'm 
hoping for a bit more of a creative solution :-)
Thanks, all...
-dale

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: Replacing a FC Drive Controller Card...