Here is a tale of woe, which you will find entertaining not only
because it offers lotsa useful information about filer management,
but also because it makes me look like an idiot.
Several weeks ago, some co-workers of mine and I undertook to add
a fibre-channel controller and shelf to each of two F630's, each
of which previously had only one dual-SCSI adapter.
We did the wrong thing, as it turned out (much later). The FC cards
are PCI cards with a long edge connector. F630's have two kinds of
PCI slots: some long, some short. We naively jumped to the conclusion
that long cards go in a long slot, so we installed the FC cards in slot
10, rather than slot 7, which had been our plan before opening the
filers.
Slot 10, it turns out, was a bad idea. Disk controller cards on F630's
are only supported in slots 5-8, inclusive. And, believe it or not,
you are supposed to put the card in there in spite of the fact that it
is too big...you just leave the extra section of connector hanging in
mid-air.
So we had just put the filers in an illegal configuration. You may be
interested to know the failure mode: None. The filers booted right
up, found all the disks, and announced them; we added them to a volume,
and we were off to the races.
A week later, the regular disk scrubs found bizillions of parity
inconsistencies on each filer and fixed them.
Then the filers started crashing ("PANIC: Freeing free block"). They
would only restart after a run of wackz. Then the next disk scrub
would put us on the path to ruin again.
After several interactions with several levels of NetApp technical
support, eventually all parties came to understand what had happened,
and the fact that the FC cards are not supported in slot 10. So we got
the FC cards put into the right slots, did a wackz, and breathed
easier.
Then one of the filers crashed again.
The culprit there: wackz and disk scrubs look at two completely
different pieces of the puzzle. I had the (once again) mistaken idea
that wackz fixed a superset of the problems that a disk scrub fixes.
Wrong. Wackz only looks at the file system structure and ignores the
parity data; to fix the parity data, you need a disk scrub.
Moral of the story: to make sure your disks have happy data on them,
do both a wackz and a disk scrub. Yer not done until you do both.
With that problem fixed, we breathed a sigh of relief.
Then one of the filers crashed again.
It turned out that that filer had been in the middle of doing its
nightly backup when it crashed the first time, and so it had created
a snapshot. Call it "fred". The system came back up, and eventually
got fixed, wacked, and scrubbed, but fred was still corrupt.
My backup script contained this:
rsh toaster snap create fred
It did not check to see whether fred exists already; if so, the snap
create would just fail, and then the backup would proceed to dump the
(existing) fred. Which, if fred is corrupt, would panic the system.
I deleted fred. Then the filer was able to back itself up fine.
So: make sure your snapshots are fresh. My backup script now starts
like this:
rsh toaster snap list | grep fred > /dev/null 2>&1
if [ $? -eq 0 ]
then
echo I refuse to work under these conditions
exit 5
fi
rsh toaster snap create fred
Brian
P.S. If you know about PCI, you may know that long PCI cards are
64-bit, and short ones are 32-bit. I've omitted that from the
discussion above, since what leaps out at you first when you go
to install a PCI card is not its bit-width but its physical
width...the latter was what got us started down this awful path
in the first place.
P.P.S. NetApp Engineering is looking into building checks for illegal
configurations into future versions of ONTAP.