Re: Replacing a FC Drive Controller Card...

3 Jan 2002

      Dale:
Sounds like you've had some rough problems.  =(
It is too bad you don't have a support contract - we've found that this is
the quickest way to get something resolved.  "Doing it yourself" can often
trigger undesirable circumstances, if you know what I mean.  In fact, I
wouldn't recommend buying a filer if the maintenance costs were not within
budget.
That having been said, it looks like what you encountered was the "9/18GB
Low-profile/Half-hight spin-up issue" a.k.a NetApp Bug #19845.  There are
extensive notes on the background and fix for this problem on the NOW site,
but I'm guessing you don't have access to that either.  If you do, the URL
is:
http://now.netapp.com/Knowledgebase/solutionarea.asp?id=3.0.867965.2561976
Basically, a known manufacturing defect with those drives makes them prone
to not spinning up after being turned off.  NetApp recommends you avoiding
turning them off at all costs.  Different ones will fail each time the
whole set is turned off and it is possible that many could be broken at
once.  Without maintenance, its going to be difficult to get those
vulnerable drives replaced.  =(
The fix for this is to boot from ONTAP floppies release 5.3.4R3P1 (although
I believe most laster releases will also have this fix) and issue a special
command.  Once booted, go into maintenance mode (option 5) and run
"disk_fw_update_fix".  Our experience is that this won't always fix all
affected drives.
Now your particular situation is more complicated because you've replaced
the disk controller card.  Maybe you could put the "bad" card back in and
try the recovery procedure?  I make no claim that this will work and it
sounds scary to me, but I'm not sure what other options you have available
at this point.
Do you have more than one spare?  If so, maybe try yanking out the one that
is trying to be rebuilt upon and let it pick a different disk for
rebuilding.  Again, this sounds scary so proceed with caution.
In answering your specific questions,
1) I don't know of any command to notify the NetApp about the drive's
controller board changing - in fact, I'm relatively confident that it
doesn't care about the serial number of that board.  The contents of each
disk are managed by a set of redundant disk labels that should stay on the
drive, regardless of the controller board.  Why that isn't working for you
is a bit of a mystery to me, but then again, we've never altered the inside
of any NetApp drives.
2) Nothing is written to a hot spare (besides a disk label saying "this is
a hot spare") until the disk is added to a volume or grabbed for a
rebuild.  In your situation, that won't make a big difference because if it
has already called it a hot spare, it has probably labeled it as such.  That
could be fixed by booting from floppy and using the special disk label
editor, but without NetApp support helping you along or lots of experience
with that command, I can't recommend that as a sane idea.  As far as I
know, ONTAP always picks the next available spare of the correct size.  You
should be able to tell which one is being rebuild upon by examining "vol
status -r" output for the volume being reconstructed.
3) I'm unaware of a way to force the media retries to be higher.  There is
a special "ignore disk errors" command in the secret boot menu, but it
could end up loosing your data just as well.  You don't want to try that
without Tech Support verifying that it is a good idea based on your
situation.
I'm not sure if any of this helps.  Good luck and hopefully you'll get the
cash for a maintenance contract soon - they are invaluable.
-- Jeff
On Wed, Jan 02, 2002 at 07:31:41PM -0400, Dale Gass wrote:
...
Hey, got a couple of NetApp questions; they're a little low-level, and 
undoubtedly unsupported and warranty-voiding, etc., but any advice would 
be appreciated...  (We don't have a support contract, due to the costs.)
After moving a network appliance, it woke up kind of grumpy.  One of the 
drives (Seagate Cheetah 18G FC-AL) didn't respond at all.  Normally, 
when powering on a shelf, the green light typically goes off, blinks a 
bit, and then comes on solid.  With this drive, the light just stays on 
solid, from the instant the switch is flipped on for the shelf.  Pretty 
much non-responsive.
So the unit attempts a rebuild.  However, during the rebuild, 
unrecoverable media errors are encountered on another drive in the raid 
set.  Sigh...  "File system may be scrambled."
So...  With the netapp off, I try another drive in the non responsive 
shelf's slot.  It's status light does the normal thing, so the shelf is 
okay, something is wrong with the drive or controller.  Given the fact 
the light doesn't blink at all upon initialization, I suspect the 
controller.
Since I had another spare drive, of exactly the same make/model, I tried 
swapping the controller card with the non-responsive unit.  Doing so, 
then powering on the shelf, gives the normal status light sequence 
(blinking, then solid).  A good sign so far.
Then...  Powering on the netapp: it says the raid set is still missing a 
drive, and shows the drive with the new controller, as "uninitialized", 
and assigns it to a hot spare, and then tries the rebuild again (which 
fails on the media errors on the other drive...)
So.  I'm guessing the NetApp uses the drive's serial # (which is on the 
controller card, not the drive, I presume) to keep track of the drive's 
function.  I guess my three questions are as follows:

Is there any way to tell the NetApp that a drive's serial # has

changed?  (Where is this low-level raid configuration data stored?  In 
NVRAM I assume?  I looked around the files in /vol/vol0/etc, but nothing 
looked appropriate.)

Does the fact that the drive was flagged as a hot spare actually

cause anything to be written to the drive, or is it just noted as such 
in the NetApp's configuration?  (Also, since a rebuild was attempted, 
does that mean my data was overwritten?  I guess since it was a rebuild, 
any data that was successfully rebuilt before the media errors, should 
be the same as was on the drive before, right?  Or not...?  The drive 
#was 26.  There was another hot spare at #24; #26 is listed first in the 
host spare list on boot; would the lower-numbered or listed-first hot 
spare tend to be used?  The filer didn't indicate *which* hot spare it 
was starting to use in the rebuild.)

Each time that the filer starts up, it claims to recover different

blocks on the medium error, and claim they'll be reassigned.  Is there 
any way to force the retries on this higher, so it will be more 
determined at recovery on the medium errors?  Since different sectors 
are successfully read different times, one would think that if it did 
the reassignment each time, it would eventually recover the drive (maybe).
Dual failures really suck on raid.  I'm hoping that there's a way to 
bring this back to life.
(We do have the data on tape; but due to a number of circumstances I 
won't go into here, restoring it would be *very* laborious, so I'm 
hoping for a bit more of a creative solution :-)
Thanks, all...
-dale

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: Replacing a FC Drive Controller Card...