All the drives on the shelf disappeared from the system when it "wedged" so
they were marked failed (or just missing). When I initially made the
volumes, I organized raid groups such that there's only one drive from any
given raid group on a shelf. So with 12 disks in a shelf and 7 shelves in
the system (for example, on the R100) I have my raidgroup size set to 7 and
make sure that when the volumes were created or expanded that the disks were
added in groups of 7 by name, one from each shelf. I organized it that way
exactly for this failure case - losing a whole shelf. When the drives
disappeard there weren't enough spares to rebuild (until we powercycled the
shelf) so most of the raid groups were running one disk short, but they
didn't go offline since they only had one missing disk. Because the volumes
didn't go offline, the disks couldn't be reassimilated into the system after
the reboot because they were now out of date, so they all became spares. So
basically the system has to rebuild one drive from each of the 12 raid
groups on the system - two at a time. The netapp folks chalked it up to a
bug and I updated them to 6.5.2. Luckily the systems are internally used so
the downtime wasn't a huge deal.
Not something I want to recur on a regular basis, and with data on the
volumes it was a bit of a heart-stopper when it started happening. Lukcy
for me I suppose there was no data lost.
Simon.
-----Original Message-----
From: Michael Christian [mailto:mchristi@yahoo-inc.com]
Subject: RE: New Simplified Monitoring Tool
How did a shelf hang result in 10 failed drives? And how did you avoid
a double disk failure with that many failures?
---
This email message and any attachments are for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient or his/her representative, please contact the sender by reply email and destroy all copies of the original message.