6070's loosing packets, become unusable - toasters

12 Jun 2007


      Folks,
    I'm on a fishing expedition here.
We've had four incidents over the past ~6 months of our 6070's
loosing packets and becoming effectively unusable.    Symptoms include
*very* slow response on clients, both CIFS and NFS.  Messages file on
filer shows problems connecting to all external services (domain
controllers, NIS servers, etc.):
[selected lines clipped]
Wed Jun  6 21:02:23 EDT [xxxxx: nis.server.inactive:warning]: preferred NIS Server xxx.xx.xxx.x not responding
Wed Jun  6 21:02:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Wed Jun  6 21:15:10 EDT [xxxxx: nfsap_process:warning]: Directory service outage prevents NFS server from determining if client (xxx.xx.xxx.x has root access to path /vol/YYYYYY/YYYYYY (xid 1346843745).  Client will experience delayed access during outage.
Wed Jun  6 21:15:18 EDT [xxxxx: rpc.client.error:error]: yp_match: clnt_call: RPC: Timed out
Wed Jun  6 21:16:45 EDT [xxxxx: rshd_0:error]: rshd: when reading user name to use on this machine from ZZZZZZZZZ, it didn't arrive within 60 seconds.
Wed Jun  6 21:28:54 EDT [xxxxx: auth.dc.trace.DCConnection.errorMsg:error]: AUTH: Domain Controller error: NetLogon error 0xc0000022: - Filer's security information differs from domain controller.
Thu Jun  7 00:21:08 EDT [xxxxx: nis_worker_0:warning]: Local NIS group update failed. Could not download the group file from the NIS server.
Thu Jun  7 00:21:11 EDT [xxxxx: nis.server.inactive:warning]: preferred NIS Server xxx.xx.xxx.x not responding
Thu Jun  7 00:21:11 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Thu Jun  7 00:21:24 EDT [xxxxx: nis.servers.not.available:error]: NIS server(s) not available.
Thu Jun  7 00:21:37 EDT [xxxxx: mnt_assist:warning]: Client xxx.xxx.xx.xx (xid 0) fails to resolve via gethostbyaddr_r() for root access - host_errno = 2, errno = 61
Thu Jun  7 00:22:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Our Nagios system trips on 100% ping loss almost immediately, and
then flaps until the filer is restarted.
First time, we just rebooted (the whole company twiddling
their thumbs, waiting...).  Second time, we begged for diagnosis time and
called it in, and ended up replacing the motherboard (suspected network
port).  *Third* time, we no longer suspected hardware, dumped core then
rebooted; core analysis didn't find anything.  FOURTH time happened last
week; dumped core, sent it in, awaiting analysis but not hopeful.
This has only happened on our 6070's.  Never on our 980's or R200's
(running same ONTAP versions).  It's happened on two different ONTAP
7.2 releases.  It's happened on three different 6070's (two incidents
on one of them).  We're running full flowcontrol.  We're running vif's.
System firmware on six of them (and this latest victim) was upgraded in
February 2007.  We have eight 6070's in one location; two have been affected.
We have six 6070's in another location and one has been affected.
The only stat that looks "weird" is that ifstat shows an unusually
high number of transmit queue overflows / discards (ifstat from autosupport just
before last core dump):
...
===== IFSTAT-A =====
-- interface  e0a  (102 days, 11 hours, 43 minutes, 36 seconds) --
RECEIVE
 Frames/second:   24198  | Bytes/second:     4311k | Errors/minute:       0
 Discards/minute:     0  | Total frames:      384g | Total bytes:       107t
 Total errors:        1  | Total discards:   1201k | Multi/broadcast:     0
 No buffers:       1201k | Non-primary u/c:     0  | Tag drop:            0
 Vlan tag drop:       0  | Vlan untag drop:     0  | CRC errors:          1
 Alignment errors:    0  | Runt frames:         0  | Long frames:         0
 Fragment:            0  | Jabber:              0  | Xon:                 0
 Xoff:                0  | Ring full:           0  | Jumbo:               0
 Jumbo error:         0
TRANSMIT
 Frames/second:   29384  | Bytes/second:    13057k | Errors/minute:       0
 Discards/minute:     0  | Total frames:      470g | Total bytes:       314t
 Total errors:        0  | Total discards:    954k | Multi/broadcast:   123k
 Queue overflows:   954k | No buffers:          0  | Single collision:    0
 Multi collisions:    0  | Late collisions:     0  | Max collisions:      0
 Deferred:            0  | Xon:                 0  | Xoff:                0
 MAC Internal:        0  | Jumbo:               0
LINK_INFO
 Current state:       up | Up to downs:         1  | Speed:            1000m
 Duplex:            full | Flowcontrol:       full
Switch side shows no extraordinary errors.
So, I'm fishing for some ideas here.  Waiting for it to happen again
to gather network traces is not an optimal strategy (as much as I'd like
to have those traces as well).  I'm sure NetApp support will do their
best in analyzing the core, but as I mentioned above, I'm not hopeful
they will find the culprit this second time.  So, I'm looking for any
anecdotes, other weird behavior, crazy ideas, or even possible ones [1],
from this large group, in parallel with NetApp's efforts.
[1]  "When you have eliminated all which is impossible, then whatever
remains, however improbable, must be the truth."       --Sherlock Holmes
Until next time...
The MathWorks, Inc.				508-647-7000 x7792
3 Apple Hill Drive, Natick, MA 01760-2098	508-647-7001 FAX
tmerrill@mathworks.com				http://www.mathworks.com
---