Hello,
I am deploying a new 8040 and it was requested that the aggregates / raid
groups are laid out in such a way that no more than 2 disks in any raid
group are within the same shelf.
At first this sounds like it reduces single points of failure and could
protect availability from the failure of a full disk shelf.
I argue against this strategy and was wondering if anyone in this list had
any feedback.
My thought is that this configuration is marginally increasing availability
at the sacrifice of additional risk to data integrity. With this strategy,
each time a disk failed we would endure not only the initial rebuilt to
spare, but a second rebuild when a disk replace is executed to put the
original shelf/slot/disk back into the the active raid group.
Additional, if a shelf failure were encountered, I question whether it
would even be possible to limp along. In an example configuration, we would
be down 24 disks, 4 or 5 would rebuild to the remaining spares available.
Those rebuilds along should require significant cpu to occur concurrently
and I expect would impact data services significantly. Additionally, at
least 10 other raid groups would be either single or double degraded. I
expect the performance degradation at this point would be so great that the
most practical course of action would be to shutdown the system until the
failed shelf could be replaced.
Thanks for any input. I would like to know if anyone has any experience
thinking through this type of scenario. Is considering this configuration
interesting or perhaps silly? Are any best practice recommendations being
violated?
Thanks in advance.
--Jordan