On 6/12/07, Todd C. Merrill <tmerrill@mathworks.com> wrote:

Folks,
        I'm on a fishing expedition here.

        We've had four incidents over the past ~6 months of our 6070's
loosing packets and becoming effectively unusable.    Symptoms include
*very* slow response on clients, both CIFS and NFS.  Messages file on
filer shows problems connecting to all external services (domain
controllers, NIS servers, etc.):

[selected lines clipped]

Wed Jun  6 21:02:23 EDT [xxxxx: nis.server.inactive:warning]: preferred NIS Server xxx.xx.xxx.x not responding
Wed Jun  6 21:02:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x

Wed Jun  6 21:15:10 EDT [xxxxx: nfsap_process:warning]: Directory service outage prevents NFS server from determining if client ( xxx.xx.xxx.x has root access to path /vol/YYYYYY/YYYYYY (xid 1346843745).  Client will experience delayed access during outage.

Wed Jun  6 21:15:18 EDT [xxxxx: rpc.client.error:error]: yp_match: clnt_call: RPC: Timed out

Wed Jun  6 21:16:45 EDT [xxxxx: rshd_0:error]: rshd: when reading user name to use on this machine from ZZZZZZZZZ, it didn't arrive within 60 seconds.

Wed Jun  6 21:28:54 EDT [xxxxx: auth.dc.trace.DCConnection.errorMsg:error ]: AUTH: Domain Controller error: NetLogon error 0xc0000022: - Filer's security information differs from domain controller.

Thu Jun  7 00:21:08 EDT [xxxxx: nis_worker_0:warning]: Local NIS group update failed. Could not download the group file from the NIS server.
Thu Jun  7 00:21:11 EDT [xxxxx: nis.server.inactive:warning]: preferred NIS Server xxx.xx.xxx.x not responding
Thu Jun  7 00:21:11 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x

Thu Jun  7 00:21:24 EDT [xxxxx: nis.servers.not.available:error]: NIS server(s) not available.
Thu Jun  7 00:21:37 EDT [xxxxx: mnt_assist:warning]: Client xxx.xxx.xx.xx (xid 0) fails to resolve via gethostbyaddr_r() for root access - host_errno = 2, errno = 61
Thu Jun  7 00:22:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x

Our Nagios system trips on 100% ping loss almost immediately, and
then flaps until the filer is restarted.

        First time, we just rebooted (the whole company twiddling
their thumbs, waiting...).  Second time, we begged for diagnosis time and
called it in, and ended up replacing the motherboard (suspected network
port).  *Third* time, we no longer suspected hardware, dumped core then
rebooted; core analysis didn't find anything.  FOURTH time happened last
week; dumped core, sent it in, awaiting analysis but not hopeful.

        This has only happened on our 6070's.  Never on our 980's or R200's
(running same ONTAP versions).  It's happened on two different ONTAP
7.2 releases.  It's happened on three different 6070's (two incidents
on one of them).  We're running full flowcontrol.  We're running vif's.
System firmware on six of them (and this latest victim) was upgraded in
February 2007.  We have eight 6070's in one location; two have been affected.
We have six 6070's in another location and one has been affected.

        The only stat that looks "weird" is that ifstat shows an unusually
high number of transmit queue overflows / discards (ifstat from autosupport just
before last core dump):

> ===== IFSTAT-A =====
>
> -- interface  e0a  (102 days, 11 hours, 43 minutes, 36 seconds) --
>
> RECEIVE
>  Frames/second:   24198  | Bytes/second:     4311k | Errors/minute:       0
>  Discards/minute:     0  | Total frames:      384g | Total bytes:       107t
>  Total errors:        1  | Total discards:   1201k | Multi/broadcast:     0
>  No buffers:       1201k | Non-primary u/c:     0  | Tag drop:            0
>  Vlan tag drop:       0  | Vlan untag drop:     0  | CRC errors:          1
>  Alignment errors:    0  | Runt frames:         0  | Long frames:         0
>  Fragment:            0  | Jabber:              0  | Xon:                 0
>  Xoff:                0  | Ring full:           0  | Jumbo:               0
>  Jumbo error:         0
> TRANSMIT
>  Frames/second:   29384  | Bytes/second:    13057k | Errors/minute:       0
>  Discards/minute:     0  | Total frames:      470g | Total bytes:       314t
>  Total errors:        0  | Total discards:    954k | Multi/broadcast:   123k
>  Queue overflows:   954k | No buffers:          0  | Single collision:    0
>  Multi collisions:    0  | Late collisions:     0  | Max collisions:      0
>  Deferred:            0  | Xon:                 0  | Xoff:                0
>  MAC Internal:        0  | Jumbo:               0
> LINK_INFO
>  Current state:       up | Up to downs:         1  | Speed:            1000m
>  Duplex:            full | Flowcontrol:       full

Switch side shows no extraordinary errors.

        So, I'm fishing for some ideas here.  Waiting for it to happen again
to gather network traces is not an optimal strategy (as much as I'd like
to have those traces as well).  I'm sure NetApp support will do their
best in analyzing the core, but as I mentioned above, I'm not hopeful
they will find the culprit this second time.  So, I'm looking for any
anecdotes, other weird behavior, crazy ideas, or even possible ones [1],
from this large group, in parallel with NetApp's efforts.

[1]  "When you have eliminated all which is impossible, then whatever
remains, however improbable, must be the truth."       --Sherlock Holmes

        Until next time...

The MathWorks, Inc.                             508-647-7000 x7792
3 Apple Hill Drive, Natick, MA 01760-2098       508-647-7001 FAX
tmerrill@mathworks.com                          http://www.mathworks.com
---