Replacing a FC Drive Controller Card...

List overview All Threads
Download

newer

older

Update to a previous email

Etherchannel

Dale Gass

2 Jan 2002 2 Jan '02

11:31 p.m.

Hey, got a couple of NetApp questions; they're a little low-level, and undoubtedly unsupported and warranty-voiding, etc., but any advice would be appreciated... (We don't have a support contract, due to the costs.)

After moving a network appliance, it woke up kind of grumpy. One of the drives (Seagate Cheetah 18G FC-AL) didn't respond at all. Normally, when powering on a shelf, the green light typically goes off, blinks a bit, and then comes on solid. With this drive, the light just stays on solid, from the instant the switch is flipped on for the shelf. Pretty much non-responsive.

So the unit attempts a rebuild. However, during the rebuild, unrecoverable media errors are encountered on another drive in the raid set. Sigh... "File system may be scrambled."

So... With the netapp off, I try another drive in the non responsive shelf's slot. It's status light does the normal thing, so the shelf is okay, something is wrong with the drive or controller. Given the fact the light doesn't blink at all upon initialization, I suspect the controller.

Since I had another spare drive, of exactly the same make/model, I tried swapping the controller card with the non-responsive unit. Doing so, then powering on the shelf, gives the normal status light sequence (blinking, then solid). A good sign so far.

Then... Powering on the netapp: it says the raid set is still missing a drive, and shows the drive with the new controller, as "uninitialized", and assigns it to a hot spare, and then tries the rebuild again (which fails on the media errors on the other drive...)

So. I'm guessing the NetApp uses the drive's serial # (which is on the controller card, not the drive, I presume) to keep track of the drive's function. I guess my three questions are as follows:

1. Is there any way to tell the NetApp that a drive's serial # has changed? (Where is this low-level raid configuration data stored? In NVRAM I assume? I looked around the files in /vol/vol0/etc, but nothing looked appropriate.)

2. Does the fact that the drive was flagged as a hot spare actually cause anything to be written to the drive, or is it just noted as such in the NetApp's configuration? (Also, since a rebuild was attempted, does that mean my data was overwritten? I guess since it was a rebuild, any data that was successfully rebuilt before the media errors, should be the same as was on the drive before, right? Or not...? The drive #was 26. There was another hot spare at #24; #26 is listed first in the host spare list on boot; would the lower-numbered or listed-first hot spare tend to be used? The filer didn't indicate *which* hot spare it was starting to use in the rebuild.)

3. Each time that the filer starts up, it claims to recover different blocks on the medium error, and claim they'll be reassigned. Is there any way to force the retries on this higher, so it will be more determined at recovery on the medium errors? Since different sectors are successfully read different times, one would think that if it did the reassignment each time, it would eventually recover the drive (maybe).

Dual failures really suck on raid. I'm hoping that there's a way to bring this back to life.

(We do have the data on tape; but due to a number of circumstances I won't go into here, restoring it would be *very* laborious, so I'm hoping for a bit more of a creative solution :-)

Thanks, all...

-dale

Show replies by date

Jeffrey Krueger

3 Jan 3 Jan

7:56 p.m.

Dale:

Sounds like you've had some rough problems. =(

It is too bad you don't have a support contract - we've found that this is the quickest way to get something resolved. "Doing it yourself" can often trigger undesirable circumstances, if you know what I mean. In fact, I wouldn't recommend buying a filer if the maintenance costs were not within budget.

That having been said, it looks like what you encountered was the "9/18GB Low-profile/Half-hight spin-up issue" a.k.a NetApp Bug #19845. There are extensive notes on the background and fix for this problem on the NOW site, but I'm guessing you don't have access to that either. If you do, the URL is:

http://now.netapp.com/Knowledgebase/solutionarea.asp?id=3.0.867965.2561976

Basically, a known manufacturing defect with those drives makes them prone to not spinning up after being turned off. NetApp recommends you avoiding turning them off at all costs. Different ones will fail each time the whole set is turned off and it is possible that many could be broken at once. Without maintenance, its going to be difficult to get those vulnerable drives replaced. =(

The fix for this is to boot from ONTAP floppies release 5.3.4R3P1 (although I believe most laster releases will also have this fix) and issue a special command. Once booted, go into maintenance mode (option 5) and run "disk_fw_update_fix". Our experience is that this won't always fix all affected drives.

Now your particular situation is more complicated because you've replaced the disk controller card. Maybe you could put the "bad" card back in and try the recovery procedure? I make no claim that this will work and it sounds scary to me, but I'm not sure what other options you have available at this point.

Do you have more than one spare? If so, maybe try yanking out the one that is trying to be rebuilt upon and let it pick a different disk for rebuilding. Again, this sounds scary so proceed with caution.

In answering your specific questions,

1) I don't know of any command to notify the NetApp about the drive's controller board changing - in fact, I'm relatively confident that it doesn't care about the serial number of that board. The contents of each disk are managed by a set of redundant disk labels that should stay on the drive, regardless of the controller board. Why that isn't working for you is a bit of a mystery to me, but then again, we've never altered the inside of any NetApp drives.

2) Nothing is written to a hot spare (besides a disk label saying "this is a hot spare") until the disk is added to a volume or grabbed for a rebuild. In your situation, that won't make a big difference because if it has already called it a hot spare, it has probably labeled it as such. That could be fixed by booting from floppy and using the special disk label editor, but without NetApp support helping you along or lots of experience with that command, I can't recommend that as a sane idea. As far as I know, ONTAP always picks the next available spare of the correct size. You should be able to tell which one is being rebuild upon by examining "vol status -r" output for the volume being reconstructed.

3) I'm unaware of a way to force the media retries to be higher. There is a special "ignore disk errors" command in the secret boot menu, but it could end up loosing your data just as well. You don't want to try that without Tech Support verifying that it is a good idea based on your situation.

I'm not sure if any of this helps. Good luck and hopefully you'll get the cash for a maintenance contract soon - they are invaluable.

-- Jeff

On Wed, Jan 02, 2002 at 07:31:41PM -0400, Dale Gass wrote:

...

Hey, got a couple of NetApp questions; they're a little low-level, and undoubtedly unsupported and warranty-voiding, etc., but any advice would be appreciated... (We don't have a support contract, due to the costs.)

After moving a network appliance, it woke up kind of grumpy. One of the drives (Seagate Cheetah 18G FC-AL) didn't respond at all. Normally, when powering on a shelf, the green light typically goes off, blinks a bit, and then comes on solid. With this drive, the light just stays on solid, from the instant the switch is flipped on for the shelf. Pretty much non-responsive.

So the unit attempts a rebuild. However, during the rebuild, unrecoverable media errors are encountered on another drive in the raid set. Sigh... "File system may be scrambled."

So... With the netapp off, I try another drive in the non responsive shelf's slot. It's status light does the normal thing, so the shelf is okay, something is wrong with the drive or controller. Given the fact the light doesn't blink at all upon initialization, I suspect the controller.

Since I had another spare drive, of exactly the same make/model, I tried swapping the controller card with the non-responsive unit. Doing so, then powering on the shelf, gives the normal status light sequence (blinking, then solid). A good sign so far.

Then... Powering on the netapp: it says the raid set is still missing a drive, and shows the drive with the new controller, as "uninitialized", and assigns it to a hot spare, and then tries the rebuild again (which fails on the media errors on the other drive...)

So. I'm guessing the NetApp uses the drive's serial # (which is on the controller card, not the drive, I presume) to keep track of the drive's function. I guess my three questions are as follows:

Is there any way to tell the NetApp that a drive's serial # has

changed? (Where is this low-level raid configuration data stored? In NVRAM I assume? I looked around the files in /vol/vol0/etc, but nothing looked appropriate.)

Does the fact that the drive was flagged as a hot spare actually

cause anything to be written to the drive, or is it just noted as such in the NetApp's configuration? (Also, since a rebuild was attempted, does that mean my data was overwritten? I guess since it was a rebuild, any data that was successfully rebuilt before the media errors, should be the same as was on the drive before, right? Or not...? The drive #was 26. There was another hot spare at #24; #26 is listed first in the host spare list on boot; would the lower-numbered or listed-first hot spare tend to be used? The filer didn't indicate *which* hot spare it was starting to use in the rebuild.)

Each time that the filer starts up, it claims to recover different

blocks on the medium error, and claim they'll be reassigned. Is there any way to force the retries on this higher, so it will be more determined at recovery on the medium errors? Since different sectors are successfully read different times, one would think that if it did the reassignment each time, it would eventually recover the drive (maybe).

Dual failures really suck on raid. I'm hoping that there's a way to bring this back to life.

(We do have the data on tape; but due to a number of circumstances I won't go into here, restoring it would be *very* laborious, so I'm hoping for a bit more of a creative solution :-)

Thanks, all...

-dale

Dale Gass

4 Jan 4 Jan

7:30 p.m.

Wow! You guys are incredibly helpful. This list is quite a find indeed :-)

Here's my progress so far (yes, I know this is all very scary stuff, and I stand a good chance of having to go to tape :-):

1. Tried putting the old controller card back in, to try the disk_fw_update_fix to attempt to fix the spinup problem with the old controller card. No dice. In order to apply the firmware update, it needs to at least see the drive, and with the old controller card, it's not even seeing the drive. It scans the other drives, but doesn't apply the firmware, as it says the firmware patch is only for 118202FC drives, and I have all 118203FC. So I'm guessing there's no reviving that old controller card; it's not a spin-up issue, but a dead controller, period.

2. So, next logical step was to put the new controller card back on the old drive, since that at least allows the drive to wake up and see the world. Now, it's flagged as a hot spare, so the best course of action seems to be to modify the label, back to the raid set member it used to be, before I swapped controllers. The "secret menu" has options for listing all the lablels. Doing so, I can clearly see that there is one obvious missing entry, which can easily be reconstructed. There is:

... 7.16 : RAID/1009983010/408/0 100437a/0/10/38/ RAID/1009983010/408/0 100437a/0/10/38/ 7.16 : RAID/1009983010/408/0 100437a/0/12/38/ RAID/1009983010/408/0 100437a/0/10/38/ ...

Everything from /0/0/38/ to 0/13/38/ is there for that raid group, but no 0/11/38, so that's clearly the missing drive. So I want to relabel 7.26 as

7.16 : RAID/1009983010/408/0 100437a/0/11/38/ RAID/1009983010/408/0 100437a/0/10/38/

In the menu system which lets you edit a disk label, you have to type in the name of which fields you want to modify. When printing the labels, the headers indicate magic/time/gen/shutdown and fsid/rgid/rgdn/total, which is what I'm using as field names to try and edit the label. I'm guessing they mean something like this:

Magic - What type the disk is RAID or SPARE Time - Time the raid group was created Gen - I don't know; some unique ID per raid group, I assume Shutdown - Shutdown time? Fsid - File system ID Rgid - Raid group id Rgdn - Raid group disk number Total - Total disks in the raid group

So, I'm able to type in "magic" to set the magic, "time" to set the time, "fsid" to set the file system ID, "rgdn" to set the raid group disk number properly. (I can't set "total", but I guess that's computed.) However, I can't seem to set the "gen". "gen" isn't accepted as a field name. Without setting it, my disk label shows up as this:

7.26 : RAID/1009983010/0/0 100437a/0/11/1/ RAID/1009983010/0/0 100437a/0/11/1/

So it looks like it's in it's on raid group (with a total of 1 disks).

Does anyone know how to set that third field (that is 408 in my case), known as "gen"???? I think I'm close on this, just need that one more bit of information :-)

If it doesn't join the raid group happily after getting this labelled right, or the raid doesn't rebuild, then my last course of action will be to pull that drive, and use the "ignore medium errors" on a raid rebuild. If the medium errors are kept to a minimum, I should have most of the data. If not, oh well, at least I tried :-) And it's on to the lengthy tape restore...

Thanks for any help...

-dale

Dale Gass wrote:

...

Hey, got a couple of NetApp questions; they're a little low-level, and undoubtedly unsupported and warranty-voiding, etc., but any advice would be appreciated... (We don't have a support contract, due to the costs.)

After moving a network appliance, it woke up kind of grumpy. One of the drives (Seagate Cheetah 18G FC-AL) didn't respond at all. Normally, when powering on a shelf, the green light typically goes off, blinks a bit, and then comes on solid. With this drive, the light just stays on solid, from the instant the switch is flipped on for the shelf. Pretty much non-responsive.

So the unit attempts a rebuild. However, during the rebuild, unrecoverable media errors are encountered on another drive in the raid set. Sigh... "File system may be scrambled."

So... With the netapp off, I try another drive in the non responsive shelf's slot. It's status light does the normal thing, so the shelf is okay, something is wrong with the drive or controller. Given the fact the light doesn't blink at all upon initialization, I suspect the controller.

Since I had another spare drive, of exactly the same make/model, I tried swapping the controller card with the non-responsive unit. Doing so, then powering on the shelf, gives the normal status light sequence (blinking, then solid). A good sign so far.

Then... Powering on the netapp: it says the raid set is still missing a drive, and shows the drive with the new controller, as "uninitialized", and assigns it to a hot spare, and then tries the rebuild again (which fails on the media errors on the other drive...)

So. I'm guessing the NetApp uses the drive's serial # (which is on the controller card, not the drive, I presume) to keep track of the drive's function. I guess my three questions are as follows:

Is there any way to tell the NetApp that a drive's serial # has

changed? (Where is this low-level raid configuration data stored? In NVRAM I assume? I looked around the files in /vol/vol0/etc, but nothing looked appropriate.)

Does the fact that the drive was flagged as a hot spare actually

cause anything to be written to the drive, or is it just noted as such in the NetApp's configuration? (Also, since a rebuild was attempted, does that mean my data was overwritten? I guess since it was a rebuild, any data that was successfully rebuilt before the media errors, should be the same as was on the drive before, right? Or not...? The drive #was 26. There was another hot spare at #24; #26 is listed first in the host spare list on boot; would the lower-numbered or listed-first hot spare tend to be used? The filer didn't indicate *which* hot spare it was starting to use in the rebuild.)

Each time that the filer starts up, it claims to recover different

blocks on the medium error, and claim they'll be reassigned. Is there any way to force the retries on this higher, so it will be more determined at recovery on the medium errors? Since different sectors are successfully read different times, one would think that if it did the reassignment each time, it would eventually recover the drive (maybe).

Dual failures really suck on raid. I'm hoping that there's a way to bring this back to life.

(We do have the data on tape; but due to a number of circumstances I won't go into here, restoring it would be *very* laborious, so I'm hoping for a bit more of a creative solution :-)

Thanks, all...

-dale

Jeffrey Krueger

8:50 p.m.

Sounds like you're getting pretty close Dale!!

On Fri, Jan 04, 2002 at 03:30:09PM -0400, Dale Gass wrote:

...

Tried putting the old controller card back in, to try the

disk_fw_update_fix to attempt to fix the spinup problem with the old controller card. No dice. In order to apply the firmware update, it needs to at least see the drive, and with the old controller card, it's not even seeing the drive. It scans the other drives, but doesn't apply the firmware, as it says the firmware patch is only for 118202FC drives, and I have all 118203FC. So I'm guessing there's no reviving that old controller card; it's not a spin-up issue, but a dead controller, period.

Seems like a reasonable assumption, lets go with it!

...

which is what I'm using as field names to try and edit the label. I'm guessing they mean something like this:

Doing this from memory and I'm a bit foggy:

...

Magic - What type the disk is RAID or SPARE

I think there is also BROKEN, but you won't need that.

...

Time - Time the raid group was created

This is some codification of the last time the disk labels got updated.

...

Gen - I don't know; some unique ID per raid group, I assume

This is a some sort of generation counter that gets incremented - also on label updates I beleive.

...

Shutdown - Shutdown time?

Actually a flag on whether the volume was shutdown clean

...

Fsid - File system ID

Yup - distinguishes one volume from another

...

Rgid - Raid group id

Correct again, be aware you can have multiple raid groups in one volume so read the fsid and rgid fields closely.

...

Rgdn - Raid group disk number

Yes, and disk number zero should be the parity disk for that RAID group.

...

Total - Total disks in the raid group

Both right on the nose. Be aware there are two labels on every disk. Each set of labels is updated one at a time so that if the filer crashes, you should have at least one good set.

...

So, I'm able to type in "magic" to set the magic, "time" to set the time, "fsid" to set the file system ID, "rgdn" to set the raid group disk number properly. (I can't set "total", but I guess that's computed.) However, I can't seem to set the "gen". "gen" isn't accepted as a field name. Without setting it, my disk label shows up as this:

7.26 : RAID/1009983010/0/0 100437a/0/11/1/ RAID/1009983010/0/0 100437a/0/11/1/

So it looks like it's in it's on raid group (with a total of 1 disks).

Does anyone know how to set that third field (that is 408 in my case), known as "gen"???? I think I'm close on this, just need that one more bit of information :-)

It has been a while since I've run a label edit. Try "generation" as a field name to edit the gen.

...

If it doesn't join the raid group happily after getting this labelled right, or the raid doesn't rebuild, then my last course of action will be to pull that drive, and use the "ignore medium errors" on a raid rebuild. If the medium errors are kept to a minimum, I should have most of the data. If not, oh well, at least I tried :-) And it's on to the lengthy tape restore...

Good luck!

-- Jeff

Dale Gass

9 p.m.

Jeffrey Krueger wrote:

...

Sounds like you're getting pretty close Dale!!

I think so. The point of success/failure is pretty close, one way or another, I think :-) Only one more issue, see below:

...

On Fri, Jan 04, 2002 at 03:30:09PM -0400, Dale Gass wrote:

...

Tried putting the old controller card back in, to try the

disk_fw_update_fix to attempt to fix the spinup problem with the old controller card. No dice. In order to apply the firmware update, it needs to at least see the drive, and with the old controller card, it's not even seeing the drive. It scans the other drives, but doesn't apply the firmware, as it says the firmware patch is only for 118202FC drives, and I have all 118203FC. So I'm guessing there's no reviving that old controller card; it's not a spin-up issue, but a dead controller, period.

Seems like a reasonable assumption, lets go with it!

...
which is what I'm using as field names to try and edit the label. I'm guessing they mean something like this:

Doing this from memory and I'm a bit foggy:

...
Magic - What type the disk is RAID or SPARE

I think there is also BROKEN, but you won't need that.

...
Time - Time the raid group was created

This is some codification of the last time the disk labels got updated.

...
Gen - I don't know; some unique ID per raid group, I assume

This is a some sort of generation counter that gets incremented - also on label updates I beleive.

Cool. "Generation" was indeed the keyword required to change it. So that's taken care of.

...

...
Shutdown - Shutdown time?

Actually a flag on whether the volume was shutdown clean

...
Fsid - File system ID

Yup - distinguishes one volume from another

...
Rgid - Raid group id

Correct again, be aware you can have multiple raid groups in one volume so read the fsid and rgid fields closely.

...
Rgdn - Raid group disk number

Yes, and disk number zero should be the parity disk for that RAID group.

...
Total - Total disks in the raid group

Both right on the nose. Be aware there are two labels on every disk. Each set of labels is updated one at a time so that if the filer crashes, you should have at least one good set.

After setting the "generation" properly, I thought the "total" field would be updated correctly, but it is not. My labels for the disk in question, and the few around it, are as follows:

7.27 : RAID/1009983010/408/0 100437a/0/9/38/ RAID/1009983010/408/0 100437a/0/9/38/ 7.16 : RAID/1009983010/408/0 100437a/0/10/38/ RAID/1009983010/408/0 100437a/0/10/38/ 7.26 : RAID/1009983010/408/0 100437a/0/11/1/ RAID/1009983010/408/0 100437a/0/11/1/ 7.17 : RAID/1009983010/408/0 100437a/0/12/38/ RAID/1009983010/408/0 100437a/0/12/38/ 7.21 : RAID/1009983010/408/0 100437a/0/13/38/ RAID/1009983010/408/0 100437a/0/13/38/

So everything looks fine for my 7.26 drive that I'm trying to get to rejoin the group, *except* for the "total" field, which is /1/ instead of /38/ like everyone else. I'm not going to attempt the rebuild, until someone can help me figure out how to change that, or confirm that it's not necessary :-)

"total" doesn't work as a keyword in the label editor.

Anyone know?

...

...
So, I'm able to type in "magic" to set the magic, "time" to set the time, "fsid" to set the file system ID, "rgdn" to set the raid group disk number properly. (I can't set "total", but I guess that's computed.) However, I can't seem to set the "gen". "gen" isn't accepted as a field name. Without setting it, my disk label shows up as this:

7.26 : RAID/1009983010/0/0 100437a/0/11/1/ RAID/1009983010/0/0 100437a/0/11/1/

So it looks like it's in it's on raid group (with a total of 1 disks).

Does anyone know how to set that third field (that is 408 in my case), known as "gen"???? I think I'm close on this, just need that one more bit of information :-)

It has been a while since I've run a label edit. Try "generation" as a field name to edit the gen.

Yes, that indeed worked! :-)

...

...
If it doesn't join the raid group happily after getting this labelled right, or the raid doesn't rebuild, then my last course of action will be to pull that drive, and use the "ignore medium errors" on a raid rebuild. If the medium errors are kept to a minimum, I should have most of the data. If not, oh well, at least I tried :-) And it's on to the lengthy tape restore...

Good luck!

Thanks. I'll let folks know how I finally make out with this adventure :-)

-dale

...

-- Jeff

8630

Age (days ago)

8632

Last active (days ago)

toasters@lists.teaparty.net

4 comments

2 participants

tags (0)

participants (2)

Dale Gass
Jeffrey Krueger