I heard David N. Blank-Edelman (Thu, May 04, 2000 at 10:47:30PM -0400) say
Dave Hitz hitz@netapp.com writes:
I also have a request. Please turn on autosupport. That's extremely important to us for two reasons:
Quick question: once upon a time, machine-gone-kablooey autosupport mail was only sent when a filer booted. It was the Yoda scenario ("there is no retry, only do!"). If something went awry in that sending, the message was just dropped on the floor, no queuing, no retrying.
We had a situation where our filer bounced twice, but on the first return to service it didn't have good net (or NIS, or name resolution, can't recall which). This meant that the first "here's the problem" mail never got to the mothership, but the "just booted, everything is peachy" mail did. Does the crash autosupport facility do anything more sophisticated these days? (he asks, because his filer hasn't crashed in many moons...).
Yeah, that's better now. We used to get that a lot - part of the problem can be the interaction of Cisco switches and Netapps - the Cisco can take a short while to figure out that the link is back up to a Netapp and the mail falls on the floor while waiting for that.
Some time ago, presumably at the insistence of ourselves and others, Netapp inserted a 30 second wait and retry loop into bootup, and now I think it gives autosupport 4 or 5 tries over a couple of minutes. We always get autosupport off them now. Not that, I admit, we've had a crash message in a while, but we never used to get them on reboots either :<
As for dealing with network or DNS or whatever not coming back, one thing you can do is specify your mailhost wholely by IP - while fractionally more work to maintain if the boxes on your network change frequently, it requires a little less to go right.
If you are worried about missing a crash, use snmp monitoring and grab system.sysUptime.0, which is the uptime in hundredths of seconds. I run a large number of netapps of different breeds, and I just run a perl script which grabs the uptime off every one every 10 minutes. Any crashes are immediately apparent. It could also write to file and compare current uptime to previous uptime - if uptime drops, it can sms you, mail you, sound sires, flash lights, whatever :>
Now I just wish they'd get it right with the caches. They still only try mailing autosupport once :p
J
--------------------------------------------------------------------------- # John Denholm johnd@theplanet.net # # Webcache & Filer Administrator, Planet Online +44 113 207 6357 # ---------------------------------------------------------------------------
Error 404: There is no spoon