Thought I'd share - toasters

15 Nov 2007


      Had an interesting issue over the weekend at a client, with a happy
ending thanks to Netapp support.
Client had two aggregates on a 980 controller, one with 144GB drives,
one with 300gb drives - each aggr had 1 hot spare, running raid-dp.
During the weekend raid scrub, one of the 144gb drives starts popping
errors and goes into a predictive failure mode, so the diskcopy starts
copying data to the hot 144gb spare.  Halfway through this, the
predictively bad drive actually fails hard- now it gets interesting.
Rather than rebuild the rest of the 144gb hot spare from parity (since
it already had much of the data), ontap decided it would abort this
diskcopy and pick ANOTHER spare drive to rebuild to- of course the only
one left was a 300GB drive, so it picked that one- and instead of
right-sizing it down to 144gb, it let it stay at the original size
(since you can do that, as Adam Fox reminded us in an earlier post
today).   So now we had a 144GB aggregate with a single 300GB disk in
it, and also, I have no 300GB spares left!  Not good.
So 1st level support zeroed out the 144gb drive that had the aborted
diskcopy on it, and attempted to fail the 300gb drive with a disk fail
-f, in order to bring balance back to this world.  Didn't work; turns
out if there are no spares, you can't do a disk fail on the command
line, not even with -f!
So the next answer here was to move all the data on the 144GB aggr to
another aggr, wipe out the original aggr and rebuild.  But of course we
didn't have enough space to do that anywhere, so I was going to have to
borrow a 300gb shelf from my friendly netapp office to do the trick, and
I could see my weekend slowly slipping away.
Luckily it didn't come to that; we escalated it up to support.  The
final answer involved literally yanking the drive, and some other stuff
with the disk fail -i command.   This started building back to the
original 144gb hot spare, and that took care of the issue.
The moral of the story - if you end up accidentally recovering to a
bigger spare than the rest of your disks in the aggregate, don't
despair!  Netapp support can help.  Also- it's probably a good idea, in
mission-critical environments, to have at least TWO hot spares of each
size.
Glenn (the other one)