Re: Marking failed drives across boots?

8 Jul 1997


      ...
A week later we shut it down to replace an fddi card and afterwards
it wouldn't boot up. The failed drive was apparently working well
enough so that the Netapp thought it had a RAID drive that wasn't a
valid member of the array (inconsistent disk labels). Once we removed
the problem drive the Netapp booted just fine.
I'd have to say the above is not "normal" behavior.  But from what I've
seen, what makes the difference is how a drive fails.
Normally, if the drive fails, and reconstruction happens normally,
the drive will look bad to the system and will be unuseable.  This
gives you time to swap out the drive with a new one.
If the system reboots, and the drive has failed badly enough, it will
fail initialization on boot (it will say disk so-and-so is broken) so
it won't appear to the system as a spare.  The system will boot fine
(other than the fact that it's probably in degraded mode since it lost
a disk.)  I think this is pretty much how it's "supposed" to work.
Often, if the drive only had a minor failure, the disk will look fine
upon reboot, and will get marked as a spare.  This is known bad behavior
that should be fixed.
If the system fails in an unusual way, you can get the "inconsistent label"
or similar problem.  I've usually only seen this if the system crashes
immediately after it tries to fail a drive and was in the process of
switching over to reconstruction or degraded mode.  My understanding
as a customer is that incidents like this are also bugs that should be
fixed.
Also, there are times where a SCSI problem can cause a drive to look
bad, and the system attempts to fail the drive, but winds up rebooting
shortly thereafter due to bus problems and the drive comes back fine.
Although it "failed", it never got far enough to actually fail the
drive, and WAFL replay succeeds, so no data is lost.
Clearly an issue here is to make sure when a drive fails, it's actually
marked as BAD in some way.  However, if the drive has failed in a
most spectacular way, writing a "BAD" label onto the drive may be
impossible.  Despite the online description of bug 961, this isssue
is included within that, along with proactively failing a drive that
looks like it "needs" to be replaced given the frequency of errors.
Bruce
-- 
Bruce Sterling Woodcock                    Network & Systems Administration
Network / System Administrator             Information Technology
sterling@netapp.com                        Network Appliance, Inc.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: Marking failed drives across boots?