Had an interesting issue over the weekend at a client, with a happy ending thanks to Netapp support.

Client had two aggregates on a 980 controller, one with 144GB drives, one with 300gb drives – each aggr had 1 hot spare, running raid-dp.

During the weekend raid scrub, one of the 144gb drives starts popping errors and goes into a predictive failure mode, so the diskcopy starts copying data to the hot 144gb spare. Halfway through this, the predictively bad drive actually fails hard- now it gets interesting.

Rather than rebuild the rest of the 144gb hot spare from parity (since it already had much of the data), ontap decided it would abort this diskcopy and pick ANOTHER spare drive to rebuild to- of course the only one left was a 300GB drive, so it picked that one- and instead of right-sizing it down to 144gb, it let it stay at the original size (since you can do that, as Adam Fox reminded us in an earlier post today). So now we had a 144GB aggregate with a single 300GB disk in it, and also, I have no 300GB spares left! Not good.

So 1^st level support zeroed out the 144gb drive that had the aborted diskcopy on it, and attempted to fail the 300gb drive with a disk fail –f, in order to bring balance back to this world. Didn’t work; turns out if there are no spares, you can’t do a disk fail on the command line, not even with -f!

So the next answer here was to move all the data on the 144GB aggr to another aggr, wipe out the original aggr and rebuild. But of course we didn’t have enough space to do that anywhere, so I was going to have to borrow a 300gb shelf from my friendly netapp office to do the trick, and I could see my weekend slowly slipping away.

Luckily it didn’t come to that; we escalated it up to support. The final answer involved literally yanking the drive, and some other stuff with the disk fail –i command. This started building back to the original 144gb hot spare, and that took care of the issue.

The moral of the story – if you end up accidentally recovering to a bigger spare than the rest of your disks in the aggregate, don’t despair! Netapp support can help. Also- it’s probably a good idea, in mission-critical environments, to have at least TWO hot spares of each size.

Glenn (the other one)