Hey, got a couple of NetApp questions; they're a little low-level, and
undoubtedly unsupported and warranty-voiding, etc., but any advice would
be appreciated... (We don't have a support contract, due to the costs.)
After moving a network appliance, it woke up kind of grumpy. One of the
drives (Seagate Cheetah 18G FC-AL) didn't respond at all. Normally,
when powering on a shelf, the green light typically goes off, blinks a
bit, and then comes on solid. With this drive, the light just stays on
solid, from the instant the switch is flipped on for the shelf. Pretty
much non-responsive.
So the unit attempts a rebuild. However, during the rebuild,
unrecoverable media errors are encountered on another drive in the raid
set. Sigh... "File system may be scrambled."
So... With the netapp off, I try another drive in the non responsive
shelf's slot. It's status light does the normal thing, so the shelf is
okay, something is wrong with the drive or controller. Given the fact
the light doesn't blink at all upon initialization, I suspect the
controller.
Since I had another spare drive, of exactly the same make/model, I tried
swapping the controller card with the non-responsive unit. Doing so,
then powering on the shelf, gives the normal status light sequence
(blinking, then solid). A good sign so far.
Then... Powering on the netapp: it says the raid set is still missing a
drive, and shows the drive with the new controller, as "uninitialized",
and assigns it to a hot spare, and then tries the rebuild again (which
fails on the media errors on the other drive...)
So. I'm guessing the NetApp uses the drive's serial # (which is on the
controller card, not the drive, I presume) to keep track of the drive's
function. I guess my three questions are as follows:
1. Is there any way to tell the NetApp that a drive's serial # has
changed? (Where is this low-level raid configuration data stored? In
NVRAM I assume? I looked around the files in /vol/vol0/etc, but nothing
looked appropriate.)
2. Does the fact that the drive was flagged as a hot spare actually
cause anything to be written to the drive, or is it just noted as such
in the NetApp's configuration? (Also, since a rebuild was attempted,
does that mean my data was overwritten? I guess since it was a rebuild,
any data that was successfully rebuilt before the media errors, should
be the same as was on the drive before, right? Or not...? The drive
#was 26. There was another hot spare at #24; #26 is listed first in the
host spare list on boot; would the lower-numbered or listed-first hot
spare tend to be used? The filer didn't indicate *which* hot spare it
was starting to use in the rebuild.)
3. Each time that the filer starts up, it claims to recover different
blocks on the medium error, and claim they'll be reassigned. Is there
any way to force the retries on this higher, so it will be more
determined at recovery on the medium errors? Since different sectors
are successfully read different times, one would think that if it did
the reassignment each time, it would eventually recover the drive (maybe).
Dual failures really suck on raid. I'm hoping that there's a way to
bring this back to life.
(We do have the data on tape; but due to a number of circumstances I
won't go into here, restoring it would be *very* laborious, so I'm
hoping for a bit more of a creative solution :-)
Thanks, all...
-dale