Leila wrote:
> Is it true that when one HD goes bad, it will be reconstructed on the
> hotspare, but if you don't pull out bad HD and replace it within the 24
> hours system will shutdown?
Not quite; you're confusing two separate things.
When a drive fails, the system goes into `degraded' mode. (A second
drive failure in the same RAID group at this point will result in
loss of data.)
When the system is in degraded mode, it will immediately start copying
data to a hot spare (*if* one is available!) to replace the failed
drive. When that process is complete, the system is no longer in
degraded mode, because the hot spare has been made a part of the RAID
group, so the RAID group is no longer down a disk. (Of course, though,
the whole system is now down one hot spare.) This can be a somewhat
lengthy process, depending on the size of the RAID group.
What you're thinking of is that by default, if the system has been
running in degraded mode for 24 hours (i.e. if a drive failed 24 hours
ago and hasn't been replaced either by a hot spare or a new disk),
the system will shut down. I think the idea is to reduce the likelihoood
of data loss caused by a second disk failure (and make sure the admin
knows that something's wrong). At least under 5.1, you can adjust
the time limit, so for instance you could make it long enough to cover
a long weekend, or if a system were under 24x7 supervision, you could
shorten it.
Unless your RAID groups are too large and your machine is very heavily
loaded, the only case where you're likely to stay in degraded mode
for 24 hours (and risk the machine shutting down) is if you *don't*
have a hot spare, so the rebuild can't happen.
AFAIK, there's no problem leaving a bad drive sitting in the shelf
arbitrarily long, as long as you don't need the drive bay for something
else (unless the drive is bad in a way that might affect other drives,
like if flames or sparks are coming out of it :-).
> What happens in a case if you have 2 hotspare?
In the normal case, it doesn't make any difference to this scenario,
except that you end up (before you physically replace drives) with one
hot spare instead of zero. It means if another drive fails *after the
RAID group has been rebuilt*, there's a hot spare to replace it. Also,
if a drive fails in a different RAID group, you should be OK.
-j.
--
Jay Sekora
<jay(a)ccs.neu.edu>
Northeastern University
College of Computer Science