Folks,
Hey again, Jay.
I've got a question about a tcp stack failure on one of the clustered 740's, that happened over the weekend.
Gary Andrade sez:
The box would respond to ping (inbound), but any outbound attempts failed. The outbound TCP stack was toast. Skip and I conferred on the situation and the box was rebooted (problem resolved).
The fail over to Wesson did not occur due to the inbound ping working, the other server sensed Smith as working.
I'm wondering: is it possible to configure clustering (smith and wesson are a pair of 740's) to detect tcp failure? There are no cluster-level error messages on smith.
messages.0 ... Sat Dec 18 23:52:41 GMT [smith: rc]: DNS server for domain "saegis.com" not responding : Connection timed out. Sun Dec 19 00:00:00 GMT [smith: statd]: 12:00am up 29 days, 21:18 138103986 NFS ops, 0 CIFS ops, 0 HTTP ops
And this is the reboot:
[appworx@:/smith/spool/etc]$ head messages Sun Dec 19 00:26:45 GMT [smith: rc]: System shut down with "reboot" command. Sun Dec 19 00:26:45 GMT [smith: cf_main]: Cluster monitor: takeover of partner disabled (local halt in progress)
The web servers did show nfs failures while smith was having its problem: ... Dec 19 00:01:43 ww5 automountd[235]: server smith not responding Dec 19 00:01:43 ww5 last message repeated 6 times Dec 19 00:08:14 ww5 automountd[235]: server smith not responding Dec 19 00:08:15 ww5 last message repeated 6 times
This server was rebooted as well, for the same reason -- stack failure.
Can you suggest some possible reasons for the filer's tcp stack failure? We did have an incident last week where both web servers needed to be rebooted for the same reason -- tcp resets from stack overflow; the tentative theory at the moment is syn flood attack. But the netapp isn't exposed to the internet.
If it's not possible to configure the filer to fail over on failed ping, then I'll script something. My second question: what do I need to sacrifice to the gods to get complete command-line rsh access?
smith> ping ww5 ww5.saegis.com is alive smith> Connection closed by foreign host. [appworx@:/apps/appworxl]$ rsh smith 'sysconfig -r' | grep root Volume spool (root) [appworx@:/apps/appworx]$ rsh smith ping ww5 ping not found. Type '?' for a list of commands [appworx@:/apps/appworx]$
I could set up a test such that if nfs fails but the filer can still be pinged, then [reboot/fail over]. Feels like a kludge, though... not specific enough. I'd prefer to have the filers handle this on their own.
Dave
Folks,
Hey again, Jay. I've got a question about a tcp stack failure on one of the clustered
740's, that happened over
the weekend.
Gary Andrade sez:
The box would respond to ping (inbound), but any outbound attempts failed. The outbound TCP stack was toast. Skip and I conferred on the situation and the box was rebooted (problem resolved).
Their failoer detection tests the ICMP code only??!? DUH! How stupid can you be? Geez, come one guys, write some real failover detection code. All the protocol stacks and layers need to be exercised.
Bruce