Here is a tale of woe, which you will find entertaining not only because it offers lotsa useful information about filer management, but also because it makes me look like an idiot.
Several weeks ago, some co-workers of mine and I undertook to add a fibre-channel controller and shelf to each of two F630's, each of which previously had only one dual-SCSI adapter.
We did the wrong thing, as it turned out (much later). The FC cards are PCI cards with a long edge connector. F630's have two kinds of PCI slots: some long, some short. We naively jumped to the conclusion that long cards go in a long slot, so we installed the FC cards in slot 10, rather than slot 7, which had been our plan before opening the filers.
Slot 10, it turns out, was a bad idea. Disk controller cards on F630's are only supported in slots 5-8, inclusive. And, believe it or not, you are supposed to put the card in there in spite of the fact that it is too big...you just leave the extra section of connector hanging in mid-air.
So we had just put the filers in an illegal configuration. You may be interested to know the failure mode: None. The filers booted right up, found all the disks, and announced them; we added them to a volume, and we were off to the races.
A week later, the regular disk scrubs found bizillions of parity inconsistencies on each filer and fixed them.
Then the filers started crashing ("PANIC: Freeing free block"). They would only restart after a run of wackz. Then the next disk scrub would put us on the path to ruin again.
After several interactions with several levels of NetApp technical support, eventually all parties came to understand what had happened, and the fact that the FC cards are not supported in slot 10. So we got the FC cards put into the right slots, did a wackz, and breathed easier.
Then one of the filers crashed again.
The culprit there: wackz and disk scrubs look at two completely different pieces of the puzzle. I had the (once again) mistaken idea that wackz fixed a superset of the problems that a disk scrub fixes. Wrong. Wackz only looks at the file system structure and ignores the parity data; to fix the parity data, you need a disk scrub.
Moral of the story: to make sure your disks have happy data on them, do both a wackz and a disk scrub. Yer not done until you do both.
With that problem fixed, we breathed a sigh of relief.
Then one of the filers crashed again.
It turned out that that filer had been in the middle of doing its nightly backup when it crashed the first time, and so it had created a snapshot. Call it "fred". The system came back up, and eventually got fixed, wacked, and scrubbed, but fred was still corrupt.
My backup script contained this:
rsh toaster snap create fred
It did not check to see whether fred exists already; if so, the snap create would just fail, and then the backup would proceed to dump the (existing) fred. Which, if fred is corrupt, would panic the system.
I deleted fred. Then the filer was able to back itself up fine.
So: make sure your snapshots are fresh. My backup script now starts like this:
rsh toaster snap list | grep fred > /dev/null 2>&1 if [ $? -eq 0 ] then echo I refuse to work under these conditions exit 5 fi rsh toaster snap create fred
Brian
P.S. If you know about PCI, you may know that long PCI cards are 64-bit, and short ones are 32-bit. I've omitted that from the discussion above, since what leaps out at you first when you go to install a PCI card is not its bit-width but its physical width...the latter was what got us started down this awful path in the first place.
P.P.S. NetApp Engineering is looking into building checks for illegal configurations into future versions of ONTAP.