Had an interesting issue over the weekend at a client, with
a happy ending thanks to Netapp support.
Client had two aggregates on a 980 controller, one with
144GB drives, one with 300gb drives – each aggr had 1 hot spare, running
raid-dp.
During the weekend raid scrub, one of the 144gb drives
starts popping errors and goes into a predictive failure mode, so the diskcopy
starts copying data to the hot 144gb spare. Halfway through this, the predictively
bad drive actually fails hard- now it gets interesting.
Rather than rebuild the rest of the 144gb hot spare from
parity (since it already had much of the data), ontap decided it would abort this
diskcopy and pick ANOTHER spare drive to rebuild to- of course the only one
left was a 300GB drive, so it picked that one- and instead of right-sizing it
down to 144gb, it let it stay at the original size (since you can do that, as
Adam Fox reminded us in an earlier post today). So now we had a 144GB
aggregate with a single 300GB disk in it, and also, I have no 300GB spares
left! Not good.
So 1st level support zeroed out the 144gb drive
that had the aborted diskcopy on it, and attempted to fail the 300gb drive with
a disk fail –f, in order to bring balance back to this world. Didn’t
work; turns out if there are no spares, you can’t do a disk fail on the
command line, not even with -f!
So the next answer here was to move all the data on the
144GB aggr to another aggr, wipe out the original aggr and rebuild. But
of course we didn’t have enough space to do that anywhere, so I was going
to have to borrow a 300gb shelf from my friendly netapp office to do the trick,
and I could see my weekend slowly slipping away.
Luckily it didn’t come to that; we escalated it up to
support. The final answer involved literally yanking the drive, and some
other stuff with the disk fail –i command. This started
building back to the original 144gb hot spare, and that took care of the issue.
The moral of the story – if you end up accidentally
recovering to a bigger spare than the rest of your disks in the aggregate, don’t
despair! Netapp support can help. Also- it’s probably a good
idea, in mission-critical environments, to have at least TWO hot spares of each
size.
Glenn (the other one)