Folks, I've just been through a "replace a disk" scenario that went wrong. It *isn't* a "disaster", because the machine wasn't being used for anything interesting at the time. I am recounting it here in case others can benefit (and so you can all smirk at my expense :-).
Old NetApp F220. One disk died. Called support guys; they sent another disk. Typed 'disk swap', swapped a disk, typed 'disk unswap'.
Hmmm. One thing wrong, another possibly. I'd swapped the wrong disk (counted them from the wrong end), and I'd possibly taken too long to complete the swap. All I know is, the filer went down in a blaze of glory.
When I rebooted, it failed to load the OS; error "Invalid opcode" (i.e. it read junk off the disk).
So I rebooted from floppies (after swapping the disks back around correctly), and that seemed cool -- it figured out which disk was what, and did all the necessary RAID reconstruction. Everything looked OK.
Rebooted. Failed to load the OS: 'Invalid opcode' again. Hmmm... At this point, I said, "Never mind" and just rebuilt the whole thing from scratch.
What I *think* I could've done was: reboot from floppies; splat the ontap stuff into /vol/vol0/etc from afar; typed 'download' at the filer, and it might've worked. (But would I have had reason to trust any of the ordinary RAID data at this point?)
If you spot a place where I (+ support guy) went badly wrong, other than what I've outlined, I'd like to know. Thx,
Will
From Will Partain on Fri, 18 Feb 2000 18:46:53 GMT:
Old NetApp F220. One disk died. Called support guys; they sent another disk. Typed 'disk swap', swapped a disk, typed 'disk unswap'.
This isn't the correct procedure. To swap a disk, type "disk swap". You may then remove a *single* disk. Wait at least 30 seconds for the disk unit status check to complete, or until you see confirmation of that in the /etc/messages file. Now type "disk swap" again. You may now insert a single disk.
You only use "disk unswap" when you have previously issued a "disk swap" command, but have decided not to add or remove a disk from the system. The "disk unswap" allows SCSI bus to resume communications.
When I rebooted, it failed to load the OS; error "Invalid opcode" (i.e. it read junk off the disk).
I'm not familiar with this particular boot-time gotcha, but it sounds consistent with getting the disks mixed up and possibly not issuing both of the required "disk swap" commands.
So I rebooted from floppies (after swapping the disks back around correctly), and that seemed cool -- it figured out which disk was what, and did all the necessary RAID reconstruction. Everything looked OK.
Booting from floppies is a good idea at that point.
What I *think* I could've done was: reboot from floppies; splat the ontap stuff into /vol/vol0/etc from afar; typed 'download' at the filer, and it might've worked. (But would I have had reason to trust any of the ordinary RAID data at this point?)
Ehe. If you can't boot the kernel off the SCSI disks, but you can off floppies then you have ...
1) ... a signaling problem on the SCSI bus - could be a bad cable, host adapter, shelf, or disk. 2) ... a horked set of boot blocks on your SCSI disks
In this situation, just running download from the console will update the boot block image on your disks from the currently installed OS in /etc.
If you can boot the kernel off SCSI disks, but you can't load your root volume then you've got (it will PANIC and scream this at you) an inconsistent volume - you're missing more than one disk.
If you spot a place where I (+ support guy) went badly wrong, other than what I've outlined, I'd like to know.
Depending on where you called support, they should have been able to walk you through the disk swapping procedure. Its not brain surgery, but not following the instructions explicitly can lead to a "disaster". Unfortunately, you just learned that the hard way. =(
As a tip - label the hell out of your shelves. You never know when the amber failure light won't light up on the problem disk or if its a 3AM swap when your brain just isn't in gear. Our filers are labeled to the extreme and probably ISO 9001 compliant. =) Its tedious, but we've found that eliminating the easy mistakes is worth the effort.
Good luck!
-- Jeff
-- ---------------------------------------------------------------------------- Jeff Krueger E-Mail: jeff@qualcomm.com NetApp File Server Lead Phone: 858-651-6709 IT Engineering and Support Fax: 858-651-6627 QUALCOMM, Incorporated Web: www.qualcomm.com
----- Original Message ----- From: Jeff Krueger jkrueger@qualcomm.com To: Will Partain partain@mekb2.sps.mot.com Cc: toasters@mathworks.com Sent: Friday, February 18, 2000 1:37 PM Subject: Re: NetApp disk replacement "disaster": post-mortem
From Will Partain on Fri, 18 Feb 2000 18:46:53 GMT:
Old NetApp F220. One disk died. Called support guys; they sent another disk. Typed 'disk swap', swapped a disk, typed 'disk unswap'.
This isn't the correct procedure. To swap a disk, type "disk swap". You may then remove a *single* disk. Wait at least 30 seconds for the disk unit status check to complete, or until you see confirmation of that in the /etc/messages file. Now type "disk swap" again. You may now insert a single disk.
I would just like to say that this is a fairly ultra-safe way to do it, and you should be able to get away with just on disk swap command, then pull the bad disk and plug the new one in. At least, that used to work fine. What does the manul say? Moreover, why didn't you follow what the manual says, Will?
When I rebooted, it failed to load the OS; error "Invalid opcode" (i.e. it read junk off the disk).
I'm not familiar with this particular boot-time gotcha, but it sounds consistent with getting the disks mixed up and possibly not issuing both of the required "disk swap" commands.
I agree, although I suspect it was just trying to read off the new disk. If he pulled the disk he just put in, I bet there is a good chance the filer would have booted fine.
So I rebooted from floppies (after swapping the disks back around correctly), and that seemed cool -- it figured out which disk was what, and did all the necessary RAID reconstruction. Everything looked OK.
Booting from floppies is a good idea at that point.
Absolutely. At that point, he did the right thing about checking the disks and making sure reconstruction was back on track, but he never typed "download" again. (Although I'm not sure why this wasn't done as part of the reconstruction process.)
Depending on where you called support, they should have been able to walk you through the disk swapping procedure. Its not brain surgery, but not following the instructions explicitly can lead to a "disaster". Unfortunately, you just learned that the hard way. =(
Yeah, I have to wonder if you got a bad support person, or if you did not actually tell them what exactly happened. He said at one point he said "never mind"; if he meant that literally, he didn't give the support person time to tell him to type download and everything would be fine again.
Bruce
From "Bruce Sterling Woodcock" on Fri, 18 Feb 2000 14:04:58 PST:
fine. What does the manul say? Moreover, why didn't you follow what the manual says, Will?
Ahhh RTFM. Gotta love it. =)
When I rebooted, it failed to load the OS; error "Invalid opcode" (i.e. it read junk off the disk).
I'm not familiar with this particular boot-time gotcha, but it sounds consistent with getting the disks mixed up and possibly not issuing both of the required "disk swap" commands.
I agree, although I suspect it was just trying to read off the new disk. If he pulled the disk he just put in, I bet there is a good chance the filer would have booted fine.
This seems consistent with the boot blocks not getting downloaded to the new disk. Although it seems odd that all subsequent boots just happen to be from that new disk, but its hard to shake the feeling that it was a missing or incorrect set of boot blocks.
Ah well, now that the volume has been scratched, we may never know. =)
-- Jeff
Actually, if you had already booted from floppy, simply running "download" at the filer prompt would have been sufficient. I have seen this exact same problem a couple of times and that was the fix.
Another alternative that usually works is to shut the system off, and swap the first two disk positions, i.e. 0 & 1 which are on the opposite side from the power supply This presumes the system is booting from disk 0. This lets the system read the same kernel from a different disk. Of course, you do not want to swap if it is a spare, in that case just pick any other non-spare disk & swap with 0 (again, if this is the boot disk)
--tmac
So I rebooted from floppies (after swapping the disks back around correctly), and that seemed cool -- it figured out which disk was what, and did all the necessary RAID reconstruction. Everything looked OK.
Booting from floppies is a good idea at that point.
What I *think* I could've done was: reboot from floppies; splat the ontap stuff into /vol/vol0/etc from afar; typed 'download' at the filer, and it might've worked. (But would I have had reason to trust any of the ordinary RAID data at this point?)
On Fri, Feb 18, 2000 at 01:37:10PM -0800, Jeff Krueger wrote:
From Will Partain on Fri, 18 Feb 2000 18:46:53 GMT:
I'm not familiar with this particular boot-time gotcha, but it sounds consistent with getting the disks mixed up and possibly not issuing both of the required "disk swap" commands.
Ehe. If you can't boot the kernel off the SCSI disks, but you can off floppies then you have ...
- ... a horked set of boot blocks on your SCSI disks
2a) above pretty common in 5.1.x releases, i see it probably about once every 5 months
In this situation, just running download from the console will update the boot block image on your disks from the currently installed OS in /etc.
As a tip - label the hell out of your shelves. You never know when the amber failure light won't light up on the problem disk or if its a 3AM swap when your brain just isn't in gear. Our filers are labeled to the extreme and probably ISO 9001 compliant. =) Its tedious, but we've found that eliminating the easy mistakes is worth the effort.
i am a visual sort of person, i like to see the red lights. take a look at the disks, if the bad disk LED isn't on, turn it on if you can. i'm not at a filer right now, but IIRC led_on will turn it on. if the disk isn't recognized by the system ( failed startup might show up in sysconfig -d ) you can turn on the 2 LEDs on either side of the failed drive, and pull the middle one. then remember to turn the LEDs off again.
-s