Marking failed drives across boots? - toasters

8 Jul 1997


      Had something similar happen to us. One of our Netapps failed a 
drive and reconstructed its data onto a different drive (from the hot
spare pool). The failed drive should have been replaced asap but it 
fell through the cracks and was forgotten.
A week later we shut it down to replace an fddi card and afterwards
it wouldn't boot up. The failed drive was apparently working well
enough so that the Netapp thought it had a RAID drive that wasn't a
valid member of the array (inconsistent disk labels). Once we removed
the problem drive the Netapp booted just fine.
I don't understand why your bad drive was added to the hot spare
pool upon reboot. It should have had a label that was inconsistent
with the other drives and the Netapp shouldn't have booted.
...
...
Could the Netapp somehow mark a bad drive so that the information
is kept across boots?
If a failed drive is working after a reboot then its label should
be inconsistent with the other drives in the array and the Netapp
shouldn't boot.
NOTE: For those wondering what label I am talking about here is an
      excerpt from the System Administrator's Guide chapter on
      Troubleshooting:
The Netapp writes a label on each disk indicating its position
      in the RAID disk array.
regards,
    Steve Gremban       gremban@ti.com
PS: I noticed that you are running without a hot spare configured.
    We normally configure a spare in order to minimize time spent
    at risk in degraded mode.
Do you consider the risk is minimal that another drive will
    fail before the first one is replaced and rebuilt or do you
    have some plan in place to ensure that someone is notified
    immediately? What about early in the morning, on weekends,
    holidays?
Anyone else out there running without hot spares?
------------- Begin Forwarded Message -------------
...
From madhatta@turing.mathworks.com Mon Jul  7 15:40:34 1997
Date: Mon, 7 Jul 1997 16:34:26 -0400 (EDT)
From: Brian Tao taob@nbc.netcom.ca
To: toasters@mathworks.com
Subject: Marking failed drives across boots?
MIME-Version: 1.0
We had a problem with our Netapp this morning that could
potentially be quite serious.  One drive near the beginning of the
chain (ID 2, I believe) was failed out by the Netapp.  Very shortly
thereafter, the filer crashed with a RAID panic and rebooted.  Upon
rebooting, it noticed that drive ID 2 was not actively being used, and
proceeded to add it to the hot spare pool.  Then it began
reconstructing the data on to (you guessed it) drive ID 2.
In this scenario, there was no time to pull out the bad drive, and
the Netapp happily rebuilt the data on it.  I guess the correct
procedure now is to forcibly fail that drive and rebuild to our good
spare drive, and remove drive ID 2.  Could the Netapp somehow mark a
bad drive so that the information is kept across boots?
-- 
Brian Tao (BT300, taob@netcom.ca)
"Though this be madness, yet there is method in't"


------------- End Forwarded Message -------------