I experienced a disk shelf failure once, some internal electronics failed, smoke and all that fun ... no data availability.

Another possible option is to enable data mirror across two separate sas domains, it used to need snapmirror_local license on 7Mode OnTAP.

On Jun 21, 2016 18:15, "jordan slingerland" <jordan.slingerland@gmail.com> wrote:

thanks for all the reply so far. That is a valid point but I believe in that situation each raidgroup could be extended by 1 or 2 disks. The initial configuration will be 12 shelves@tmac and 12 disks raid groups. so though the raid groups will end up being smaller than I would typically recommend, not tiny.

On Tue, Jun 21, 2016 at 12:01 PM, Rhorer, Kyle L. (JSC-OD)[THE BOEING COMPANY] <kyle.l.rhorer@nasa.gov> wrote:
Another issue to think about besides resiliency… what happens in this “no more than two RAID group disks per shelf” scheme when they want to add another shelf because they’re running out of capacity?

> On Jun 21, 2016, at 10:11, jordan slingerland <jordan.slingerland@gmail.com> wrote:
>
> Hello,
>
> I am deploying a new 8040 and it was requested that the aggregates / raid groups are laid out in such a way that no more than 2 disks in any raid group are within the same shelf.
>
> At first this sounds like it reduces single points of failure and could protect availability from the failure of a full disk shelf.
>
> I argue against this strategy and was wondering if anyone in this list had any feedback.
>
> My thought is that this configuration is marginally increasing availability at the sacrifice of additional risk to data integrity. With this strategy, each time a disk failed we would endure not only the initial rebuilt to spare, but a second rebuild when a disk replace is executed to put the original shelf/slot/disk back into the the active raid group.
>
> Additional, if a shelf failure were encountered, I question whether it would even be possible to limp along. In an example configuration, we would be down 24 disks, 4 or 5 would rebuild to the remaining spares available. Those rebuilds along should require significant cpu to occur concurrently and I expect would impact data services significantly. Additionally, at least 10 other raid groups would be either single or double degraded. I expect the performance degradation at this point would be so great that the most practical course of action would be to shutdown the system until the failed shelf could be replaced.
>
>
> Thanks for any input. I would like to know if anyone has any experience thinking through this type of scenario. Is considering this configuration interesting or perhaps silly? Are any best practice recommendations being violated?
>
> Thanks in advance.
>
> --Jordan

> _______________________________________________
> Toasters mailing list
> Toasters@teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters