New subject: Bad HD

9 Sep 1998


      Leila wrote:
...
Is it true that when one HD goes bad, it will be reconstructed on the
hotspare, but if you don't pull out bad HD and replace it within the 24
hours system will shutdown?
Not quite; you're confusing two separate things.
When a drive fails, the system goes into `degraded' mode.  (A second 
drive failure in the same RAID group at this point will result in 
loss of data.)
When the system is in degraded mode, it will immediately start copying
data to a hot spare (*if* one is available!) to replace the failed
drive.  When that process is complete, the system is no longer in
degraded mode, because the hot spare has been made a part of the RAID
group, so the RAID group is no longer down a disk.  (Of course, though,
the whole system is now down one hot spare.)  This can be a somewhat
lengthy process, depending on the size of the RAID group.
What you're thinking of is that by default, if the system has been 
running in degraded mode for 24 hours (i.e. if a drive failed 24 hours 
ago and hasn't been replaced either by a hot spare or a new disk), 
the system will shut down.  I think the idea is to reduce the likelihoood 
of data loss caused by a second disk failure (and make sure the admin 
knows that something's wrong).  At least under 5.1, you can adjust 
the time limit, so for instance you could make it long enough to cover 
a long weekend, or if a system were under 24x7 supervision, you could 
shorten it.
Unless your RAID groups are too large and your machine is very heavily 
loaded, the only case where you're likely to stay in degraded mode 
for 24 hours (and risk the machine shutting down) is if you *don't* 
have a hot spare, so the rebuild can't happen.
AFAIK, there's no problem leaving a bad drive sitting in the shelf 
arbitrarily long, as long as you don't need the drive bay for something 
else (unless the drive is bad in a way that might affect other drives, 
like if flames or sparks are coming out of it :-).
...
What happens in a case if you have 2 hotspare?
In the normal case, it doesn't make any difference to this scenario,
except that you end up (before you physically replace drives) with one
hot spare instead of zero.  It means if another drive fails *after the
RAID group has been rebuilt*, there's a hot spare to replace it.  Also,
if a drive fails in a different RAID group, you should be OK.
-j.
-- 
Jay Sekora
jay@ccs.neu.edu
Northeastern University
College of Computer Science

Re: Bad HD