Leila wrote:
Is it true that when one HD goes bad, it will be reconstructed on the hotspare, but if you don't pull out bad HD and replace it within the 24 hours system will shutdown?
Not quite; you're confusing two separate things.
When a drive fails, the system goes into `degraded' mode. (A second drive failure in the same RAID group at this point will result in loss of data.)
When the system is in degraded mode, it will immediately start copying data to a hot spare (*if* one is available!) to replace the failed drive. When that process is complete, the system is no longer in degraded mode, because the hot spare has been made a part of the RAID group, so the RAID group is no longer down a disk. (Of course, though, the whole system is now down one hot spare.) This can be a somewhat lengthy process, depending on the size of the RAID group.
What you're thinking of is that by default, if the system has been running in degraded mode for 24 hours (i.e. if a drive failed 24 hours ago and hasn't been replaced either by a hot spare or a new disk), the system will shut down. I think the idea is to reduce the likelihoood of data loss caused by a second disk failure (and make sure the admin knows that something's wrong). At least under 5.1, you can adjust the time limit, so for instance you could make it long enough to cover a long weekend, or if a system were under 24x7 supervision, you could shorten it.
Unless your RAID groups are too large and your machine is very heavily loaded, the only case where you're likely to stay in degraded mode for 24 hours (and risk the machine shutting down) is if you *don't* have a hot spare, so the rebuild can't happen.
AFAIK, there's no problem leaving a bad drive sitting in the shelf arbitrarily long, as long as you don't need the drive bay for something else (unless the drive is bad in a way that might affect other drives, like if flames or sparks are coming out of it :-).
What happens in a case if you have 2 hotspare?
In the normal case, it doesn't make any difference to this scenario, except that you end up (before you physically replace drives) with one hot spare instead of zero. It means if another drive fails *after the RAID group has been rebuilt*, there's a hot spare to replace it. Also, if a drive fails in a different RAID group, you should be OK.
-j.
What you're thinking of is that by default, if the system has been running in degraded mode for 24 hours (i.e. if a drive failed 24 hours ago and hasn't been replaced either by a hot spare or a new disk), the system will shut down. I think the idea is to reduce the likelihoood of data loss caused by a second disk failure (and make sure the admin knows that something's wrong).
Our motivation was really your second guess:
To make sure the admin knows something is wrong.
We couldn't come up with any way to absolutely, reliably *guarantee* that the sysadmin would get notified except to have the box turn itself off. We even joked about putting lights and sirens on the box, but the problem is, at smaller sites, people sometimes stick these things in a closet somewhere and forget about them.
Of course, at many sites, especially larger sites, people have all sorts of notification methods (like syslog, or autosupport linked into an e-mail pager), so we do have an option to turn off the 24 hour shutdown feature.
Dave