Folks, I'm on a fishing expedition here.
We've had four incidents over the past ~6 months of our 6070's loosing packets and becoming effectively unusable. Symptoms include *very* slow response on clients, both CIFS and NFS. Messages file on filer shows problems connecting to all external services (domain controllers, NIS servers, etc.):
[selected lines clipped]
Wed Jun 6 21:02:23 EDT [xxxxx: nis.server.inactive:warning]: preferred NIS Server xxx.xx.xxx.x not responding Wed Jun 6 21:02:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Wed Jun 6 21:15:10 EDT [xxxxx: nfsap_process:warning]: Directory service outage prevents NFS server from determining if client (xxx.xx.xxx.x has root access to path /vol/YYYYYY/YYYYYY (xid 1346843745). Client will experience delayed access during outage.
Wed Jun 6 21:15:18 EDT [xxxxx: rpc.client.error:error]: yp_match: clnt_call: RPC: Timed out
Wed Jun 6 21:16:45 EDT [xxxxx: rshd_0:error]: rshd: when reading user name to use on this machine from ZZZZZZZZZ, it didn't arrive within 60 seconds.
Wed Jun 6 21:28:54 EDT [xxxxx: auth.dc.trace.DCConnection.errorMsg:error]: AUTH: Domain Controller error: NetLogon error 0xc0000022: - Filer's security information differs from domain controller.
Thu Jun 7 00:21:08 EDT [xxxxx: nis_worker_0:warning]: Local NIS group update failed. Could not download the group file from the NIS server. Thu Jun 7 00:21:11 EDT [xxxxx: nis.server.inactive:warning]: preferred NIS Server xxx.xx.xxx.x not responding Thu Jun 7 00:21:11 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Thu Jun 7 00:21:24 EDT [xxxxx: nis.servers.not.available:error]: NIS server(s) not available. Thu Jun 7 00:21:37 EDT [xxxxx: mnt_assist:warning]: Client xxx.xxx.xx.xx (xid 0) fails to resolve via gethostbyaddr_r() for root access - host_errno = 2, errno = 61 Thu Jun 7 00:22:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Our Nagios system trips on 100% ping loss almost immediately, and then flaps until the filer is restarted.
First time, we just rebooted (the whole company twiddling their thumbs, waiting...). Second time, we begged for diagnosis time and called it in, and ended up replacing the motherboard (suspected network port). *Third* time, we no longer suspected hardware, dumped core then rebooted; core analysis didn't find anything. FOURTH time happened last week; dumped core, sent it in, awaiting analysis but not hopeful.
This has only happened on our 6070's. Never on our 980's or R200's (running same ONTAP versions). It's happened on two different ONTAP 7.2 releases. It's happened on three different 6070's (two incidents on one of them). We're running full flowcontrol. We're running vif's. System firmware on six of them (and this latest victim) was upgraded in February 2007. We have eight 6070's in one location; two have been affected. We have six 6070's in another location and one has been affected.
The only stat that looks "weird" is that ifstat shows an unusually high number of transmit queue overflows / discards (ifstat from autosupport just before last core dump):
===== IFSTAT-A =====
-- interface e0a (102 days, 11 hours, 43 minutes, 36 seconds) --
RECEIVE Frames/second: 24198 | Bytes/second: 4311k | Errors/minute: 0 Discards/minute: 0 | Total frames: 384g | Total bytes: 107t Total errors: 1 | Total discards: 1201k | Multi/broadcast: 0 No buffers: 1201k | Non-primary u/c: 0 | Tag drop: 0 Vlan tag drop: 0 | Vlan untag drop: 0 | CRC errors: 1 Alignment errors: 0 | Runt frames: 0 | Long frames: 0 Fragment: 0 | Jabber: 0 | Xon: 0 Xoff: 0 | Ring full: 0 | Jumbo: 0 Jumbo error: 0 TRANSMIT Frames/second: 29384 | Bytes/second: 13057k | Errors/minute: 0 Discards/minute: 0 | Total frames: 470g | Total bytes: 314t Total errors: 0 | Total discards: 954k | Multi/broadcast: 123k Queue overflows: 954k | No buffers: 0 | Single collision: 0 Multi collisions: 0 | Late collisions: 0 | Max collisions: 0 Deferred: 0 | Xon: 0 | Xoff: 0 MAC Internal: 0 | Jumbo: 0 LINK_INFO Current state: up | Up to downs: 1 | Speed: 1000m Duplex: full | Flowcontrol: full
Switch side shows no extraordinary errors.
So, I'm fishing for some ideas here. Waiting for it to happen again to gather network traces is not an optimal strategy (as much as I'd like to have those traces as well). I'm sure NetApp support will do their best in analyzing the core, but as I mentioned above, I'm not hopeful they will find the culprit this second time. So, I'm looking for any anecdotes, other weird behavior, crazy ideas, or even possible ones [1], from this large group, in parallel with NetApp's efforts.
[1] "When you have eliminated all which is impossible, then whatever remains, however improbable, must be the truth." --Sherlock Holmes
Until next time...
The MathWorks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com ---