tcp stack reset, why; and automating a fix? - toasters

20 Dec 1999


      Folks,
Hey again, Jay.
I've got a question about a tcp stack failure on one of the clustered 740's, that happened over
the weekend.
Gary Andrade sez:
...
The box would respond to ping (inbound), but any outbound attempts
failed.  The outbound TCP stack was toast.  Skip and I conferred on
the situation and the box was rebooted (problem resolved).
The fail over to Wesson did not occur due to the inbound ping working,
the other server sensed Smith as working.
I'm wondering:  is it possible to configure clustering (smith and wesson are a pair of 740's) to
detect tcp failure?  There are no cluster-level error messages on smith.
messages.0
...
Sat Dec 18 23:52:41 GMT [smith: rc]: DNS server for domain "saegis.com" not responding : Connection
timed out.
Sun Dec 19 00:00:00 GMT [smith: statd]:  12:00am up 29 days, 21:18 138103986 NFS ops, 0 CIFS ops, 0
HTTP ops
And this is the reboot:
[appworx@:/smith/spool/etc]$ head messages
Sun Dec 19 00:26:45 GMT [smith: rc]: System shut down with "reboot" command.
Sun Dec 19 00:26:45 GMT [smith: cf_main]: Cluster monitor: takeover of partner disabled (local halt
in progress)
The web servers did show nfs failures while smith was having its problem:
...
Dec 19 00:01:43 ww5 automountd[235]: server smith not responding
Dec 19 00:01:43 ww5 last message repeated 6 times
Dec 19 00:08:14 ww5 automountd[235]: server smith not responding
Dec 19 00:08:15 ww5 last message repeated 6 times
This server was rebooted as well, for the same reason -- stack failure.
Can you suggest some possible reasons for the filer's tcp stack failure?  We did have an
incident last week where both web servers needed to be rebooted for the same reason -- tcp resets
from stack overflow; the tentative theory at the moment is syn flood attack.  But the netapp isn't
exposed to the internet.
If it's not possible to configure the filer to fail over on failed ping, then I'll script
something.  My second question:  what do I need to sacrifice to the gods to get complete
command-line rsh access?
smith> ping ww5
ww5.saegis.com is alive
smith> Connection closed by foreign host.
[appworx@:/apps/appworxl]$ rsh smith 'sysconfig -r' | grep root
Volume spool (root)
[appworx@:/apps/appworx]$ rsh smith ping ww5
ping not found.  Type '?' for a list of commands
[appworx@:/apps/appworx]$
I could set up a test such that if nfs fails but the filer can still be pinged, then
[reboot/fail over].  Feels like a kludge, though... not specific enough.  I'd prefer to have the
filers handle this on their own.
Dave