Here is a tale of woe, which you will find entertaining not only because it offers lotsa useful information about filer management, but also because it makes me look like an idiot.
Several weeks ago, some co-workers of mine and I undertook to add a fibre-channel controller and shelf to each of two F630's, each of which previously had only one dual-SCSI adapter.
We did the wrong thing, as it turned out (much later). The FC cards are PCI cards with a long edge connector. F630's have two kinds of PCI slots: some long, some short. We naively jumped to the conclusion that long cards go in a long slot, so we installed the FC cards in slot 10, rather than slot 7, which had been our plan before opening the filers.
Slot 10, it turns out, was a bad idea. Disk controller cards on F630's are only supported in slots 5-8, inclusive. And, believe it or not, you are supposed to put the card in there in spite of the fact that it is too big...you just leave the extra section of connector hanging in mid-air.
So we had just put the filers in an illegal configuration. You may be interested to know the failure mode: None. The filers booted right up, found all the disks, and announced them; we added them to a volume, and we were off to the races.
A week later, the regular disk scrubs found bizillions of parity inconsistencies on each filer and fixed them.
Then the filers started crashing ("PANIC: Freeing free block"). They would only restart after a run of wackz. Then the next disk scrub would put us on the path to ruin again.
After several interactions with several levels of NetApp technical support, eventually all parties came to understand what had happened, and the fact that the FC cards are not supported in slot 10. So we got the FC cards put into the right slots, did a wackz, and breathed easier.
Then one of the filers crashed again.
The culprit there: wackz and disk scrubs look at two completely different pieces of the puzzle. I had the (once again) mistaken idea that wackz fixed a superset of the problems that a disk scrub fixes. Wrong. Wackz only looks at the file system structure and ignores the parity data; to fix the parity data, you need a disk scrub.
Moral of the story: to make sure your disks have happy data on them, do both a wackz and a disk scrub. Yer not done until you do both.
With that problem fixed, we breathed a sigh of relief.
Then one of the filers crashed again.
It turned out that that filer had been in the middle of doing its nightly backup when it crashed the first time, and so it had created a snapshot. Call it "fred". The system came back up, and eventually got fixed, wacked, and scrubbed, but fred was still corrupt.
My backup script contained this:
rsh toaster snap create fred
It did not check to see whether fred exists already; if so, the snap create would just fail, and then the backup would proceed to dump the (existing) fred. Which, if fred is corrupt, would panic the system.
I deleted fred. Then the filer was able to back itself up fine.
So: make sure your snapshots are fresh. My backup script now starts like this:
rsh toaster snap list | grep fred > /dev/null 2>&1 if [ $? -eq 0 ] then echo I refuse to work under these conditions exit 5 fi rsh toaster snap create fred
Brian
P.S. If you know about PCI, you may know that long PCI cards are 64-bit, and short ones are 32-bit. I've omitted that from the discussion above, since what leaps out at you first when you go to install a PCI card is not its bit-width but its physical width...the latter was what got us started down this awful path in the first place.
P.P.S. NetApp Engineering is looking into building checks for illegal configurations into future versions of ONTAP.
"Brian" == Brian Rice brice@gnac.com writes:
Brian> We did the wrong thing, as it turned out (much later). The Brian> FC cards are PCI cards with a long edge connector. F630's Brian> have two kinds of
I hate reading documentation. I always jump right into the parts bag, who cares if a few parts are left over when construction is done.
But, I must say, when I had to add FC-AL cards to my F740's, I made sure I looked up the slot assignments...
Brian> P.P.S. NetApp Engineering is looking into building checks Brian> for illegal configurations into future versions of ONTAP.
Um, perhaps an easier and just as effective solution might be to place a sticker somewhere prominent on the chassis either specifying the slot assignments or referring the customer to documentation.
The Sun E450's include such.
j. -- Jay Soffian jay@cimedia.com UNIX Systems Engineer 404.572.1941 Cox Interactive Media
Um, perhaps an easier and just as effective solution might be to place a sticker somewhere prominent on the chassis either specifying the slot assignments or referring the customer to documentation.
Appropriate slots can chance depending on software version, so stickers are not appropriate.
I was under the impression there were some checks in the software, but I guess it didn't detect the patricular case discussed here. It can also be a problem if they want to make an exception for a particular customer and have to roll a special release every time for them without the check.
The biggest problem is the user didn't read the documentation. I have no sympathy for them. They get what they deserve.
Bruce
On Tue, 21 Dec 1999, Jay Soffian wrote:
Um, perhaps an easier and just as effective solution might be to place a sticker somewhere prominent on the chassis either specifying the slot assignments or referring the customer to documentation.
An even better solution would be to write the code so that the controllers are supported in 64-bit mode.
Tom
On Tue, 21 Dec 1999, Brian Rice wrote:
P.P.S. NetApp Engineering is looking into building checks for illegal configurations into future versions of ONTAP.
What version are you running? 5.3.4 (and possibly earlier 5.3's) have a "sysconfig -c" which does the check for you. I think it also complains on bootup if it finds unsupported slot configurations. The data is stored in /etc/sysconfigtab on the filer. I noticed that there are already entries for the F840 and C840 filers in there. ;-)
Cluster caveat: If you install a ServerNet cluster interface and a second FC-AL adapter connected to the partner's disks *without* enabling clustering, "sysconfig -c" will report that those two cards are in an unsupported configuration. What it really means is that clustering hardware is not supported until clustering is turned on (well, okay....). Netapp tech support even reproduced the "bug" in their lab, until they realized what was going on. ;-)
Brian Rice wrote:
Here is a tale of woe, which you will find entertaining not only because it offers lotsa useful information about filer management, but also because it makes me look like an idiot.
Thanks for the post. You are right, it contains some very good info. Hopefully the "you get what you deserve" crowd won't discourage people from making similar posts in the future.
Graham
Thanks for the post. You are right, it contains some very good info. Hopefully the "you get what you deserve" crowd won't discourage people from making similar posts in the future.
Speaking of people like me... :)
I wanted to make it clear that I don't want to discourage anyone from talking. I didn't even jump on the original poster; I was just answering one of the follow-ups. People make mistakes and I've made mistakes with the best of them. If you find out something useful in your filer experience, plesae pass it along, even if it's of the "I didn't read the manual, so I did this and it crashed" variety.
ObBadFilerExperience - Replacing an old Fas 450 with an F540 and physically installing a variety of old "bare" disk drives into the DEC StorageWorks containers to use in the new shelves. "This will work, right?" Then spending the next 6 hours physically rebooting, checking to see which shelf had trouble recognizing disks, removing them one by one, rebooting again, seeing it get further along in the boot sequence or not, etc. etc. until all the problem drives were found. Some of the older drivers didn't like one version of the StorageWorks container, but did work in another, and we spent the rest of the night scrounging for parts off other containers for the right disk container. :)
Bruce
A fellow toasters subscriber wrote:
I wanted to make it clear that I don't want to discourage anyone from talking. I didn't even jump on the original poster; I was just answering one of the follow-ups. People make mistakes and I've made mistakes with the best of them. If you find out something useful in your filer experience, plesae pass it along, even if it's of the "I didn't read the manual, so I did this and it crashed" variety.
I appreciate the clarification, pardner; I was in fact feeling a little jumped on.
Anyway, I should have mentioned that I searched both NOW and the F630 documentation in vain for documentation about configuration rules for FC controllers, and found nothing. This information isn't even in the F630 field service guide.
Well into the diagnosis process, a NetApp tech support person gave me the following URL, which points to a spiff-o-matic Web page that would have been a lifesaver, had I been able to find it earlier:
http://now.netapp.com/NOW/knowledge/docs/hardware/syscfg/index.htm
But here's what would have been really smart for me to do before I began: pick up the phone and call NetApp technical supprt and ask which slots accept an FC card on an F630. Of all the dumb things I did, this sin of omission was certainly the dumbest.
In response to another writer's question: I'm running ONTAP 5.3.2D3, so sysconfig -c is not there for me yet. Nice to know it's out there, though.
Finally, let me chime in in agreement on what my compatriot says above. One of the most valuable side effects of the toasters mailing list is that it provides a searchable database of what happens when you do X. I hope that folks who have done X, even a dumb X, will post their horror stories, so as to get the information out there and on disk. I'm sure that many toasters subscribers, including me, would be happy to forward along stories to the list in an anonymous fashion (anonymous to the actual culprit, that is; stripping out the identifying information). Hey, if the FAA can do this for airline pilots, we can do it for filer admins.
Brian brice@gnac.com
On Wed, 22 Dec 1999, Bruce Sterling Woodcock wrote:
Some of the older drivers didn't like one version of the StorageWorks container, but did work in another, and we spent the rest of the night scrounging for parts off other containers for the right disk container. :)
Is that right. I just plugged in a brand spanking new spare (well it's been on the shelf for a while) into a 330's shelf and the disk was not detected even after several tries (a defective spare?). I pop a different disk in and it is detected immediately. I wonder whether that was the problem, whether the shelf is becomming flakey, oe whether it was simply a bad spare.
Tom
I've had this happen before... a long time ago anyway.
Check the disk. In my case I tried to put a WIDE SCSI disk into a narrow shelf. Result, not even recognized.
Pulled it out, looked at the label, and in the wise words of Homer Simpson said "Dogh!!" when I read "4GB WIDE SCSI".
--tmac
tkaczma@gryf.net wrote:
On Wed, 22 Dec 1999, Bruce Sterling Woodcock wrote:
Some of the older drivers didn't like one version of the StorageWorks container, but did work in another, and we spent the rest of the night scrounging for parts off other containers for the right disk container. :)
Is that right. I just plugged in a brand spanking new spare (well it's been on the shelf for a while) into a 330's shelf and the disk was not detected even after several tries (a defective spare?). I pop a different disk in and it is detected immediately. I wonder whether that was the problem, whether the shelf is becomming flakey, oe whether it was simply a bad spare.
Tom
-- ******All New Numbers!!!****** ************* *************
Timothy A. McCarthy --> System Engineer, Eastern Region Network Appliance http://www.netapp.com 240-268-2034 Office \ / Page Me at: 240-268-2002 Fax / 888-971-4468
----- Original Message ----- From: tkaczma@gryf.net To: toasters@mathworks.com Sent: Wednesday, December 22, 1999 12:55 PM Subject: Re: Some fun things not to try at home
On Wed, 22 Dec 1999, Bruce Sterling Woodcock wrote:
Some of the older drivers didn't like one version of the StorageWorks container, but did work in another, and we spent the rest of the night scrounging for parts off other containers for the right disk container. :)
Is that right. I just plugged in a brand spanking new spare (well it's been on the shelf for a while) into a 330's shelf and the disk was not detected even after several tries (a defective spare?). I pop a different disk in and it is detected immediately. I wonder whether that was the problem, whether the shelf is becomming flakey, oe whether it was simply a bad spare.
This was some time back. Specifically, DEC DSP3210S drives (the 2 GB narrow drives) only worked in DEC StorageWorks containers which had serial numbers that started with 41, 42, or 43 when placed in the narrow slots on a wide shelf. Newer containers probably work as well; I seem to recall the ones that didn't were numbered starting in the 30s. I don't think Netapp ever sold such drives in those containers so this is unlikely to be your problem unless you had an old drive you stuffed into a old container, and even then, I think the problem only cropped up when you used it in a wide shelf. I could be wrong about that part, though.
If you have looked inside the old DEC carriers, there was a little chip on the internal ribbon that connected the internal drive to the external back connecter. The chip configuration was visibly different between the ones with the older serial numbers (the serial number ON THE CARRIER, mind you) and the newer ones. (I'm presuming the higher numbers were more recent models, which was our reasoned assumption at the time.)
Bruce