Some fun things not to try at home

21 Dec 1999


      Here is a tale of woe, which you will find entertaining not only
because it offers lotsa useful information about filer management,
but also because it makes me look like an idiot.
Several weeks ago, some co-workers of mine and I undertook to add
a fibre-channel controller and shelf to each of two F630's, each 
of which previously had only one dual-SCSI adapter.
We did the wrong thing, as it turned out (much later).  The FC cards
are PCI cards with a long edge connector.  F630's have two kinds of
PCI slots: some long, some short.  We naively jumped to the conclusion
that long cards go in a long slot, so we installed the FC cards in slot
10, rather than slot 7, which had been our plan before opening the
filers.
Slot 10, it turns out, was a bad idea.  Disk controller cards on F630's
are only supported in slots 5-8, inclusive.  And, believe it or not,
you are supposed to put the card in there in spite of the fact that it
is too big...you just leave the extra section of connector hanging in
mid-air.
So we had just put the filers in an illegal configuration.  You may be
interested to know the failure mode:  None.  The filers booted right
up, found all the disks, and announced them; we added them to a volume,
and we were off to the races.
A week later, the regular disk scrubs found bizillions of parity 
inconsistencies on each filer and fixed them.
Then the filers started crashing ("PANIC: Freeing free block").  They
would only restart after a run of wackz.  Then the next disk scrub
would put us on the path to ruin again.
After several interactions with several levels of NetApp technical
support, eventually all parties came to understand what had happened,
and the fact that the FC cards are not supported in slot 10.  So we got
the FC cards put into the right slots, did a wackz, and breathed
easier.
Then one of the filers crashed again.
The culprit there: wackz and disk scrubs look at two completely
different pieces of the puzzle.  I had the (once again) mistaken idea
that wackz fixed a superset of the problems that a disk scrub fixes.
Wrong.  Wackz only looks at the file system structure and ignores the
parity data; to fix the parity data, you need a disk scrub.
Moral of the story: to make sure your disks have happy data on them,
do both a wackz and a disk scrub.  Yer not done until you do both.
With that problem fixed, we breathed a sigh of relief.
Then one of the filers crashed again.
It turned out that that filer had been in the middle of doing its
nightly backup when it crashed the first time, and so it had created
a snapshot.  Call it "fred".  The system came back up, and eventually
got fixed, wacked, and scrubbed, but fred was still corrupt.
My backup script contained this:
rsh toaster snap create fred
It did not check to see whether fred exists already; if so, the snap
create would just fail, and then the backup would proceed to dump the 
(existing) fred.  Which, if fred is corrupt, would panic the system.
I deleted fred.  Then the filer was able to back itself up fine.
So: make sure your snapshots are fresh.  My backup script now starts 
like this:
rsh toaster snap list | grep fred > /dev/null 2>&1
    if [ $? -eq 0 ]
    then
    	echo I refuse to work under these conditions
    	exit 5
    fi
    rsh toaster snap create fred
Brian
P.S.   If you know about PCI, you may know that long PCI cards are
       64-bit, and short ones are 32-bit.  I've omitted that from the
       discussion above, since what leaps out at you first when you go
       to install a PCI card is not its bit-width but its physical
       width...the latter was what got us started down this awful path
       in the first place.
P.P.S. NetApp Engineering is looking into building checks for illegal
       configurations into future versions of ONTAP.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Some fun things not to try at home