Had something similar happen to us. One of our Netapps failed a drive and reconstructed its data onto a different drive (from the hot spare pool). The failed drive should have been replaced asap but it fell through the cracks and was forgotten.
A week later we shut it down to replace an fddi card and afterwards it wouldn't boot up. The failed drive was apparently working well enough so that the Netapp thought it had a RAID drive that wasn't a valid member of the array (inconsistent disk labels). Once we removed the problem drive the Netapp booted just fine.
I don't understand why your bad drive was added to the hot spare pool upon reboot. It should have had a label that was inconsistent with the other drives and the Netapp shouldn't have booted.
Could the Netapp somehow mark a bad drive so that the information is kept across boots?
If a failed drive is working after a reboot then its label should be inconsistent with the other drives in the array and the Netapp shouldn't boot.
NOTE: For those wondering what label I am talking about here is an excerpt from the System Administrator's Guide chapter on Troubleshooting:
The Netapp writes a label on each disk indicating its position in the RAID disk array.
regards, Steve Gremban gremban@ti.com
PS: I noticed that you are running without a hot spare configured. We normally configure a spare in order to minimize time spent at risk in degraded mode.
Do you consider the risk is minimal that another drive will fail before the first one is replaced and rebuilt or do you have some plan in place to ensure that someone is notified immediately? What about early in the morning, on weekends, holidays?
Anyone else out there running without hot spares?
------------- Begin Forwarded Message -------------
From madhatta@turing.mathworks.com Mon Jul 7 15:40:34 1997
Date: Mon, 7 Jul 1997 16:34:26 -0400 (EDT) From: Brian Tao taob@nbc.netcom.ca To: toasters@mathworks.com Subject: Marking failed drives across boots? MIME-Version: 1.0
We had a problem with our Netapp this morning that could potentially be quite serious. One drive near the beginning of the chain (ID 2, I believe) was failed out by the Netapp. Very shortly thereafter, the filer crashed with a RAID panic and rebooted. Upon rebooting, it noticed that drive ID 2 was not actively being used, and proceeded to add it to the hot spare pool. Then it began reconstructing the data on to (you guessed it) drive ID 2.
In this scenario, there was no time to pull out the bad drive, and the Netapp happily rebuilt the data on it. I guess the correct procedure now is to forcibly fail that drive and rebuild to our good spare drive, and remove drive ID 2. Could the Netapp somehow mark a bad drive so that the information is kept across boots?
A week later we shut it down to replace an fddi card and afterwards it wouldn't boot up. The failed drive was apparently working well enough so that the Netapp thought it had a RAID drive that wasn't a valid member of the array (inconsistent disk labels). Once we removed the problem drive the Netapp booted just fine.
I'd have to say the above is not "normal" behavior. But from what I've seen, what makes the difference is how a drive fails.
Normally, if the drive fails, and reconstruction happens normally, the drive will look bad to the system and will be unuseable. This gives you time to swap out the drive with a new one.
If the system reboots, and the drive has failed badly enough, it will fail initialization on boot (it will say disk so-and-so is broken) so it won't appear to the system as a spare. The system will boot fine (other than the fact that it's probably in degraded mode since it lost a disk.) I think this is pretty much how it's "supposed" to work.
Often, if the drive only had a minor failure, the disk will look fine upon reboot, and will get marked as a spare. This is known bad behavior that should be fixed.
If the system fails in an unusual way, you can get the "inconsistent label" or similar problem. I've usually only seen this if the system crashes immediately after it tries to fail a drive and was in the process of switching over to reconstruction or degraded mode. My understanding as a customer is that incidents like this are also bugs that should be fixed.
Also, there are times where a SCSI problem can cause a drive to look bad, and the system attempts to fail the drive, but winds up rebooting shortly thereafter due to bus problems and the drive comes back fine. Although it "failed", it never got far enough to actually fail the drive, and WAFL replay succeeds, so no data is lost.
Clearly an issue here is to make sure when a drive fails, it's actually marked as BAD in some way. However, if the drive has failed in a most spectacular way, writing a "BAD" label onto the drive may be impossible. Despite the online description of bug 961, this isssue is included within that, along with proactively failing a drive that looks like it "needs" to be replaced given the frequency of errors.
Bruce