Had something similar happen to us. One of our Netapps failed a drive and reconstructed its data onto a different drive (from the hot spare pool). The failed drive should have been replaced asap but it fell through the cracks and was forgotten.
A week later we shut it down to replace an fddi card and afterwards it wouldn't boot up. The failed drive was apparently working well enough so that the Netapp thought it had a RAID drive that wasn't a valid member of the array (inconsistent disk labels). Once we removed the problem drive the Netapp booted just fine.
I don't understand why your bad drive was added to the hot spare pool upon reboot. It should have had a label that was inconsistent with the other drives and the Netapp shouldn't have booted.
Could the Netapp somehow mark a bad drive so that the information is kept across boots?
If a failed drive is working after a reboot then its label should be inconsistent with the other drives in the array and the Netapp shouldn't boot.
NOTE: For those wondering what label I am talking about here is an excerpt from the System Administrator's Guide chapter on Troubleshooting:
The Netapp writes a label on each disk indicating its position in the RAID disk array.
regards, Steve Gremban gremban@ti.com
PS: I noticed that you are running without a hot spare configured. We normally configure a spare in order to minimize time spent at risk in degraded mode.
Do you consider the risk is minimal that another drive will fail before the first one is replaced and rebuilt or do you have some plan in place to ensure that someone is notified immediately? What about early in the morning, on weekends, holidays?
Anyone else out there running without hot spares?
------------- Begin Forwarded Message -------------
From madhatta@turing.mathworks.com Mon Jul 7 15:40:34 1997
Date: Mon, 7 Jul 1997 16:34:26 -0400 (EDT) From: Brian Tao taob@nbc.netcom.ca To: toasters@mathworks.com Subject: Marking failed drives across boots? MIME-Version: 1.0
We had a problem with our Netapp this morning that could potentially be quite serious. One drive near the beginning of the chain (ID 2, I believe) was failed out by the Netapp. Very shortly thereafter, the filer crashed with a RAID panic and rebooted. Upon rebooting, it noticed that drive ID 2 was not actively being used, and proceeded to add it to the hot spare pool. Then it began reconstructing the data on to (you guessed it) drive ID 2.
In this scenario, there was no time to pull out the bad drive, and the Netapp happily rebuilt the data on it. I guess the correct procedure now is to forcibly fail that drive and rebuild to our good spare drive, and remove drive ID 2. Could the Netapp somehow mark a bad drive so that the information is kept across boots?