Folks, I'm on a fishing expedition here.
We've had four incidents over the past ~6 months of our 6070's loosing packets and becoming effectively unusable. Symptoms include *very* slow response on clients, both CIFS and NFS. Messages file on filer shows problems connecting to all external services (domain controllers, NIS servers, etc.):
[selected lines clipped]
Wed Jun 6 21:02:23 EDT [xxxxx: nis.server.inactive:warning]: preferred NIS Server xxx.xx.xxx.x not responding Wed Jun 6 21:02:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Wed Jun 6 21:15:10 EDT [xxxxx: nfsap_process:warning]: Directory service outage prevents NFS server from determining if client (xxx.xx.xxx.x has root access to path /vol/YYYYYY/YYYYYY (xid 1346843745). Client will experience delayed access during outage.
Wed Jun 6 21:15:18 EDT [xxxxx: rpc.client.error:error]: yp_match: clnt_call: RPC: Timed out
Wed Jun 6 21:16:45 EDT [xxxxx: rshd_0:error]: rshd: when reading user name to use on this machine from ZZZZZZZZZ, it didn't arrive within 60 seconds.
Wed Jun 6 21:28:54 EDT [xxxxx: auth.dc.trace.DCConnection.errorMsg:error]: AUTH: Domain Controller error: NetLogon error 0xc0000022: - Filer's security information differs from domain controller.
Thu Jun 7 00:21:08 EDT [xxxxx: nis_worker_0:warning]: Local NIS group update failed. Could not download the group file from the NIS server. Thu Jun 7 00:21:11 EDT [xxxxx: nis.server.inactive:warning]: preferred NIS Server xxx.xx.xxx.x not responding Thu Jun 7 00:21:11 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Thu Jun 7 00:21:24 EDT [xxxxx: nis.servers.not.available:error]: NIS server(s) not available. Thu Jun 7 00:21:37 EDT [xxxxx: mnt_assist:warning]: Client xxx.xxx.xx.xx (xid 0) fails to resolve via gethostbyaddr_r() for root access - host_errno = 2, errno = 61 Thu Jun 7 00:22:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Our Nagios system trips on 100% ping loss almost immediately, and then flaps until the filer is restarted.
First time, we just rebooted (the whole company twiddling their thumbs, waiting...). Second time, we begged for diagnosis time and called it in, and ended up replacing the motherboard (suspected network port). *Third* time, we no longer suspected hardware, dumped core then rebooted; core analysis didn't find anything. FOURTH time happened last week; dumped core, sent it in, awaiting analysis but not hopeful.
This has only happened on our 6070's. Never on our 980's or R200's (running same ONTAP versions). It's happened on two different ONTAP 7.2 releases. It's happened on three different 6070's (two incidents on one of them). We're running full flowcontrol. We're running vif's. System firmware on six of them (and this latest victim) was upgraded in February 2007. We have eight 6070's in one location; two have been affected. We have six 6070's in another location and one has been affected.
The only stat that looks "weird" is that ifstat shows an unusually high number of transmit queue overflows / discards (ifstat from autosupport just before last core dump):
===== IFSTAT-A =====
-- interface e0a (102 days, 11 hours, 43 minutes, 36 seconds) --
RECEIVE Frames/second: 24198 | Bytes/second: 4311k | Errors/minute: 0 Discards/minute: 0 | Total frames: 384g | Total bytes: 107t Total errors: 1 | Total discards: 1201k | Multi/broadcast: 0 No buffers: 1201k | Non-primary u/c: 0 | Tag drop: 0 Vlan tag drop: 0 | Vlan untag drop: 0 | CRC errors: 1 Alignment errors: 0 | Runt frames: 0 | Long frames: 0 Fragment: 0 | Jabber: 0 | Xon: 0 Xoff: 0 | Ring full: 0 | Jumbo: 0 Jumbo error: 0 TRANSMIT Frames/second: 29384 | Bytes/second: 13057k | Errors/minute: 0 Discards/minute: 0 | Total frames: 470g | Total bytes: 314t Total errors: 0 | Total discards: 954k | Multi/broadcast: 123k Queue overflows: 954k | No buffers: 0 | Single collision: 0 Multi collisions: 0 | Late collisions: 0 | Max collisions: 0 Deferred: 0 | Xon: 0 | Xoff: 0 MAC Internal: 0 | Jumbo: 0 LINK_INFO Current state: up | Up to downs: 1 | Speed: 1000m Duplex: full | Flowcontrol: full
Switch side shows no extraordinary errors.
So, I'm fishing for some ideas here. Waiting for it to happen again to gather network traces is not an optimal strategy (as much as I'd like to have those traces as well). I'm sure NetApp support will do their best in analyzing the core, but as I mentioned above, I'm not hopeful they will find the culprit this second time. So, I'm looking for any anecdotes, other weird behavior, crazy ideas, or even possible ones [1], from this large group, in parallel with NetApp's efforts.
[1] "When you have eliminated all which is impossible, then whatever remains, however improbable, must be the truth." --Sherlock Holmes
Until next time...
The MathWorks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com ---
just for kicks,
what is the contents of /etc/rc pertaining to the network?
What about the stanza(s) from the switch about the connected interface?
I use LACP for my vifs (FAS980/R200 and used to on a FAS6070 until it was changed to GX).
Thanks
On 6/12/07, Todd C. Merrill tmerrill@mathworks.com wrote:
Folks, I'm on a fishing expedition here.
We've had four incidents over the past ~6 months of our 6070's
loosing packets and becoming effectively unusable. Symptoms include *very* slow response on clients, both CIFS and NFS. Messages file on filer shows problems connecting to all external services (domain controllers, NIS servers, etc.):
[selected lines clipped]
Wed Jun 6 21:02:23 EDT [xxxxx: nis.server.inactive:warning]: preferred NIS Server xxx.xx.xxx.x not responding Wed Jun 6 21:02:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Wed Jun 6 21:15:10 EDT [xxxxx: nfsap_process:warning]: Directory service outage prevents NFS server from determining if client (xxx.xx.xxx.x has root access to path /vol/YYYYYY/YYYYYY (xid 1346843745). Client will experience delayed access during outage.
Wed Jun 6 21:15:18 EDT [xxxxx: rpc.client.error:error]: yp_match: clnt_call: RPC: Timed out
Wed Jun 6 21:16:45 EDT [xxxxx: rshd_0:error]: rshd: when reading user name to use on this machine from ZZZZZZZZZ, it didn't arrive within 60 seconds.
Wed Jun 6 21:28:54 EDT [xxxxx: auth.dc.trace.DCConnection.errorMsg:error]: AUTH: Domain Controller error: NetLogon error 0xc0000022: - Filer's security information differs from domain controller.
Thu Jun 7 00:21:08 EDT [xxxxx: nis_worker_0:warning]: Local NIS group update failed. Could not download the group file from the NIS server. Thu Jun 7 00:21:11 EDT [xxxxx: nis.server.inactive:warning]: preferred NIS Server xxx.xx.xxx.x not responding Thu Jun 7 00:21:11 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Thu Jun 7 00:21:24 EDT [xxxxx: nis.servers.not.available:error]: NIS server(s) not available. Thu Jun 7 00:21:37 EDT [xxxxx: mnt_assist:warning]: Client xxx.xxx.xx.xx(xid 0) fails to resolve via gethostbyaddr_r() for root access - host_errno = 2, errno = 61 Thu Jun 7 00:22:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Our Nagios system trips on 100% ping loss almost immediately, and then flaps until the filer is restarted.
First time, we just rebooted (the whole company twiddling
their thumbs, waiting...). Second time, we begged for diagnosis time and called it in, and ended up replacing the motherboard (suspected network port). *Third* time, we no longer suspected hardware, dumped core then rebooted; core analysis didn't find anything. FOURTH time happened last week; dumped core, sent it in, awaiting analysis but not hopeful.
This has only happened on our 6070's. Never on our 980's or
R200's (running same ONTAP versions). It's happened on two different ONTAP 7.2 releases. It's happened on three different 6070's (two incidents on one of them). We're running full flowcontrol. We're running vif's. System firmware on six of them (and this latest victim) was upgraded in February 2007. We have eight 6070's in one location; two have been affected. We have six 6070's in another location and one has been affected.
The only stat that looks "weird" is that ifstat shows an unusually
high number of transmit queue overflows / discards (ifstat from autosupport just before last core dump):
===== IFSTAT-A =====
-- interface e0a (102 days, 11 hours, 43 minutes, 36 seconds) --
RECEIVE Frames/second: 24198 | Bytes/second: 4311k |
Errors/minute: 0
Discards/minute: 0 | Total frames: 384g | Total bytes:
107t
Total errors: 1 | Total discards: 1201k |
Multi/broadcast: 0
No buffers: 1201k | Non-primary u/c: 0 | Tag
drop: 0
Vlan tag drop: 0 | Vlan untag drop: 0 | CRC
errors: 1
Alignment errors: 0 | Runt frames: 0 | Long
frames: 0
Fragment: 0 | Jabber: 0 |
Xon: 0
Xoff: 0 | Ring full: 0 |
Jumbo: 0
Jumbo error: 0 TRANSMIT Frames/second: 29384 | Bytes/second: 13057k |
Errors/minute: 0
Discards/minute: 0 | Total frames: 470g | Total bytes:
314t
Total errors: 0 | Total discards: 954k | Multi/broadcast:
123k
Queue overflows: 954k | No buffers: 0 | Single
collision: 0
Multi collisions: 0 | Late collisions: 0 | Max
collisions: 0
Deferred: 0 | Xon: 0 |
Xoff: 0
MAC Internal: 0 | Jumbo: 0 LINK_INFO Current state: up | Up to downs: 1 |
Speed: 1000m
Duplex: full | Flowcontrol: full
Switch side shows no extraordinary errors.
So, I'm fishing for some ideas here. Waiting for it to happen
again to gather network traces is not an optimal strategy (as much as I'd like to have those traces as well). I'm sure NetApp support will do their best in analyzing the core, but as I mentioned above, I'm not hopeful they will find the culprit this second time. So, I'm looking for any anecdotes, other weird behavior, crazy ideas, or even possible ones [1], from this large group, in parallel with NetApp's efforts.
[1] "When you have eliminated all which is impossible, then whatever remains, however improbable, must be the truth." --Sherlock Holmes
Until next time...
The MathWorks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com
I'll try to answer most of the questions posed so far...
For the filer involved in the latest incident:
what is the contents of /etc/rc pertaining to the network?
ifconfig e0a flowcontrol full ifconfig e0b flowcontrol full vif create single vif17 e0a e0b vif favor e0a ifconfig vif17 `hostname`-vif17 up netmask 255.255.255.0 broadcast xxx.xx.xxx.255 partner vif18 route add default xxx.xx.xxx.x 1 routed on
What about the stanza(s) from the switch about the connected interface?
interface GigabitEthernet6/7 description xxx17[e0a] switchport switchport access vlan xxx switchport mode access no ip address flowcontrol receive on flowcontrol send on spanning-tree portfast
I use LACP for my vifs (FAS980/R200 and used to on a FAS6070 until it was changed to GX).
Our network guys told me we do not use LACP.
Are the VIFs in single mode or multi mode? Maybe try going the other way unless you have bandwidth constraints. We run single here because the interfaces are plugging into different switches.
Single mode (see config info above). We do that because we don't need the bandwidth; we often sniff the port; and we only require the redundancy.
Wondering if the vifs have been created out of multiple interfaces on the same or across multiple switches?
Same switch, different blades.
What does the configuration look like? How many interfaces on the switch? How many switches? Vlans? Etc.... Are there any other interfaces configured on different subnets/networks? Do they become unresponsive also?
Switch interface config shown above. Switch has 4x48 interfaces. One switch involved. 40+ VLANs on the switch in total. None of the other subnets/networks experience the unresponsiveness. And, neither do the other five 6070's on this same switch on the same VLAN on the same subnet.
What kind of switches are you using, and what code rev?
Cisco 6506.
#sh version Cisco Internetwork Operating System Software IOS (tm) s3223_rp Software (s3223_rp-IPBASEK9-M), Version 12.2(18)SXF, RELEASE SOFTWARE (fc1) Technical Support: http://www.cisco.com/techsupport Copyright (c) 1986-2005 by cisco Systems, Inc. Compiled Fri 09-Sep-05 21:36 by ccai Image text-base: 0x40101040, data-base: 0x42CC0000
ROM: System Bootstrap, Version 12.2(17r)SX3, RELEASE SOFTWARE (fc1) BOOTLDR: s3223_rp Software (s3223_rp-IPBASEK9-M), Version 12.2(18)SXF, RELEASE SOFTWARE (fc1)
My first focus would be on the LAN switch fabric- making sure that spanning tree is functioning correctly; if you are using VLANs, that your ISL trunks are correct; this smells like some sort of broadcast storm/loop. You're describing a situation where 'all of a sudden' you get a 100% ping loss and then flapping up and down on the network, that really sounds like a switching issue- as if the spanning tree is converging because something ELSE got added to the segment. Do you have Portfast set on the ports the netapp is connected to? How many interfaces do you have?
No other devices on the same switch, on the same VLAN, on the same subnet experience a similar loss of connectivity. :-\
Network guys say the logs show no spanning tree activity during or just prior to the incident.
OR- perhaps someone is introducing a device with the same IP as the netapp?
Also no duplicate IP entries in the logs either.
Thanks for the ideas...keep them comin'...
Until next time...
The MathWorks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com ---
What is the output of "show mod" on your switch?
Why no LACP? it is supposed to improve the failover capability.
It is simple to setup: # ifconfig e0a flowcontrol full #not needed # ifconfig e0b flowcontrol full #not needed vif create lacp -b mac vif17 e0a e0b # vif favor e0a not used in multi or lacp ifconfig vif17 `hostname`-vif17 up netmask 255.255.255.0 broadcast xxx.xx.xxx.255 partner vif18 route add default xxx.xx.xxx.x 1 routed on
On the switch: interface GigabitEthernet6/7 description xxx17[e0a] switchport switchport access vlan xxx switchport mode access no ip address flowcontrol receive on flowcontrol send on spanning-tree portfast channel-group 99 mode active
(the active in the last line defines LACP, setting it to on is straight up etherchannel, i.e. pre 7.2 code)
And also on the switch: interface Port-channel 99 description PortChannel for vif17 switchport switchport access vlan xxx switchport mode access no ip address flowcontrol receive on flowcontrol send on spanning-tree portfast
On 6/13/07, Todd C. Merrill tmerrill@mathworks.com wrote:
I'll try to answer most of the questions posed so far...
For the filer involved in the latest incident:
what is the contents of /etc/rc pertaining to the network?
ifconfig e0a flowcontrol full ifconfig e0b flowcontrol full vif create single vif17 e0a e0b vif favor e0a ifconfig vif17 `hostname`-vif17 up netmask 255.255.255.0 broadcast xxx.xx.xxx.255 partner vif18 route add default xxx.xx.xxx.x 1 routed on
What about the stanza(s) from the switch about the connected interface?
interface GigabitEthernet6/7 description xxx17[e0a] switchport switchport access vlan xxx switchport mode access no ip address flowcontrol receive on flowcontrol send on spanning-tree portfast
I use LACP for my vifs (FAS980/R200 and used to on a FAS6070 until it
was changed to GX).
Our network guys told me we do not use LACP.
Are the VIFs in single mode or multi mode? Maybe try going the other way unless you have bandwidth constraints. We run single here because the interfaces are plugging into different switches.
Single mode (see config info above). We do that because we don't need the bandwidth; we often sniff the port; and we only require the redundancy.
Wondering if the vifs have been created out of multiple interfaces on the same or across multiple switches?
Same switch, different blades.
What does the configuration look like? How many interfaces on the switch? How many switches? Vlans? Etc.... Are there any other interfaces configured on different subnets/networks? Do they become unresponsive also?
Switch interface config shown above. Switch has 4x48 interfaces. One switch involved. 40+ VLANs on the switch in total. None of the other subnets/networks experience the unresponsiveness. And, neither do the other five 6070's on this same switch on the same VLAN on the same subnet.
What kind of switches are you using, and what code rev?
Cisco 6506.
#sh version Cisco Internetwork Operating System Software IOS (tm) s3223_rp Software (s3223_rp-IPBASEK9-M), Version 12.2(18)SXF, RELEASE SOFTWARE (fc1) Technical Support: http://www.cisco.com/techsupport Copyright (c) 1986-2005 by cisco Systems, Inc. Compiled Fri 09-Sep-05 21:36 by ccai Image text-base: 0x40101040, data-base: 0x42CC0000
ROM: System Bootstrap, Version 12.2(17r)SX3, RELEASE SOFTWARE (fc1) BOOTLDR: s3223_rp Software (s3223_rp-IPBASEK9-M), Version 12.2(18)SXF, RELEASE SOFTWARE (fc1)
My first focus would be on the LAN switch fabric- making sure that spanning tree is functioning correctly; if you are using VLANs, that your ISL trunks are correct; this smells like some sort of broadcast storm/loop. You're describing a situation where 'all of a sudden' you get a 100% ping loss and then flapping up and down on the network, that really sounds like a switching issue- as if the spanning tree is converging because something ELSE got added to the segment. Do you have Portfast set on the ports the netapp is connected to? How many interfaces do you have?
No other devices on the same switch, on the same VLAN, on the same subnet experience a similar loss of connectivity. :-\
Network guys say the logs show no spanning tree activity during or just prior to the incident.
OR- perhaps someone is introducing a device with the same IP as the netapp?
Also no duplicate IP entries in the logs either.
Thanks for the ideas...keep them comin'...
Until next time...
The MathWorks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com
Todd,
Are the VIFs in single mode or multi mode? Maybe try going the other way unless you have bandwidth constraints. We run single here because the interfaces are plugging into different switches.
Also regarding your system firmware note, v1.4 came out at the end of April, and support just recommended we install it "just in case" due to a FCAL loop failure from one of our onboard ports on a 6030.
HTH,
Hadrian Baron Network Engineer
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Todd C. Merrill Sent: Tuesday, June 12, 2007 11:12 AM To: toasters@mathworks.com Subject: 6070's loosing packets, become unusable
Folks, I'm on a fishing expedition here.
We've had four incidents over the past ~6 months of our 6070's loosing packets and becoming effectively unusable. Symptoms include *very* slow response on clients, both CIFS and NFS. Messages file on filer shows problems connecting to all external services (domain controllers, NIS servers, etc.):
[selected lines clipped]
Wed Jun 6 21:02:23 EDT [xxxxx: nis.server.inactive:warning]: preferred NIS Server xxx.xx.xxx.x not responding Wed Jun 6 21:02:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Wed Jun 6 21:15:10 EDT [xxxxx: nfsap_process:warning]: Directory service outage prevents NFS server from determining if client (xxx.xx.xxx.x has root access to path /vol/YYYYYY/YYYYYY (xid 1346843745). Client will experience delayed access during outage.
Wed Jun 6 21:15:18 EDT [xxxxx: rpc.client.error:error]: yp_match: clnt_call: RPC: Timed out
Wed Jun 6 21:16:45 EDT [xxxxx: rshd_0:error]: rshd: when reading user name to use on this machine from ZZZZZZZZZ, it didn't arrive within 60 seconds.
Wed Jun 6 21:28:54 EDT [xxxxx: auth.dc.trace.DCConnection.errorMsg:error]: AUTH: Domain Controller error: NetLogon error 0xc0000022: - Filer's security information differs from domain controller.
Thu Jun 7 00:21:08 EDT [xxxxx: nis_worker_0:warning]: Local NIS group update failed. Could not download the group file from the NIS server. Thu Jun 7 00:21:11 EDT [xxxxx: nis.server.inactive:warning]: preferred NIS Server xxx.xx.xxx.x not responding Thu Jun 7 00:21:11 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Thu Jun 7 00:21:24 EDT [xxxxx: nis.servers.not.available:error]: NIS server(s) not available. Thu Jun 7 00:21:37 EDT [xxxxx: mnt_assist:warning]: Client xxx.xxx.xx.xx (xid 0) fails to resolve via gethostbyaddr_r() for root access - host_errno = 2, errno = 61 Thu Jun 7 00:22:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Our Nagios system trips on 100% ping loss almost immediately, and then flaps until the filer is restarted.
First time, we just rebooted (the whole company twiddling their thumbs, waiting...). Second time, we begged for diagnosis time and called it in, and ended up replacing the motherboard (suspected network port). *Third* time, we no longer suspected hardware, dumped core then rebooted; core analysis didn't find anything. FOURTH time happened last week; dumped core, sent it in, awaiting analysis but not hopeful.
This has only happened on our 6070's. Never on our 980's or R200's (running same ONTAP versions). It's happened on two different ONTAP 7.2 releases. It's happened on three different 6070's (two incidents on one of them). We're running full flowcontrol. We're running vif's. System firmware on six of them (and this latest victim) was upgraded in February 2007. We have eight 6070's in one location; two have been affected. We have six 6070's in another location and one has been affected.
The only stat that looks "weird" is that ifstat shows an unusually high number of transmit queue overflows / discards (ifstat from autosupport just before last core dump):
===== IFSTAT-A =====
-- interface e0a (102 days, 11 hours, 43 minutes, 36 seconds) --
RECEIVE Frames/second: 24198 | Bytes/second: 4311k | Errors/minute:
0
Discards/minute: 0 | Total frames: 384g | Total bytes:
107t
Total errors: 1 | Total discards: 1201k | Multi/broadcast:
0
No buffers: 1201k | Non-primary u/c: 0 | Tag drop:
0
Vlan tag drop: 0 | Vlan untag drop: 0 | CRC errors:
1
Alignment errors: 0 | Runt frames: 0 | Long frames:
0
Fragment: 0 | Jabber: 0 | Xon:
0
Xoff: 0 | Ring full: 0 | Jumbo:
0
Jumbo error: 0 TRANSMIT Frames/second: 29384 | Bytes/second: 13057k | Errors/minute:
0
Discards/minute: 0 | Total frames: 470g | Total bytes:
314t
Total errors: 0 | Total discards: 954k | Multi/broadcast:
123k
Queue overflows: 954k | No buffers: 0 | Single collision:
0
Multi collisions: 0 | Late collisions: 0 | Max collisions:
0
Deferred: 0 | Xon: 0 | Xoff:
0
MAC Internal: 0 | Jumbo: 0 LINK_INFO Current state: up | Up to downs: 1 | Speed:
1000m
Duplex: full | Flowcontrol: full
Switch side shows no extraordinary errors.
So, I'm fishing for some ideas here. Waiting for it to happen again to gather network traces is not an optimal strategy (as much as I'd like to have those traces as well). I'm sure NetApp support will do their best in analyzing the core, but as I mentioned above, I'm not hopeful they will find the culprit this second time. So, I'm looking for any anecdotes, other weird behavior, crazy ideas, or even possible ones [1], from this large group, in parallel with NetApp's efforts.
[1] "When you have eliminated all which is impossible, then whatever remains, however improbable, must be the truth." --Sherlock Holmes
Until next time...
The MathWorks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com ---
Wondering if the vifs have been created out of multiple interfaces on the same or across multiple switches? I've seen some weird behavior in the past where the client's network switches had issues with refreshing mac addresses across some switches so the vif would not be reachable when a failover occurred until after 2 minutes...
What does the configuration look like? How many interfaces on the switch? How many switches? Vlans? Etc.... Are there any other interfaces configured on different subnets/networks? Do they become unresponsive also?
Hmmmm, interesting...
Best Regards, Julio C.
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Todd C. Merrill Sent: Tuesday, June 12, 2007 11:12 AM To: toasters@mathworks.com Subject: 6070's loosing packets, become unusable
Folks, I'm on a fishing expedition here.
We've had four incidents over the past ~6 months of our 6070's loosing packets and becoming effectively unusable. Symptoms include *very* slow response on clients, both CIFS and NFS. Messages file on filer shows problems connecting to all external services (domain controllers, NIS servers, etc.):
[selected lines clipped]
Wed Jun 6 21:02:23 EDT [xxxxx: nis.server.inactive:warning]: preferred NIS Server xxx.xx.xxx.x not responding Wed Jun 6 21:02:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Wed Jun 6 21:15:10 EDT [xxxxx: nfsap_process:warning]: Directory service outage prevents NFS server from determining if client (xxx.xx.xxx.x has root access to path /vol/YYYYYY/YYYYYY (xid 1346843745). Client will experience delayed access during outage.
Wed Jun 6 21:15:18 EDT [xxxxx: rpc.client.error:error]: yp_match: clnt_call: RPC: Timed out
Wed Jun 6 21:16:45 EDT [xxxxx: rshd_0:error]: rshd: when reading user name to use on this machine from ZZZZZZZZZ, it didn't arrive within 60 seconds.
Wed Jun 6 21:28:54 EDT [xxxxx: auth.dc.trace.DCConnection.errorMsg:error]: AUTH: Domain Controller error: NetLogon error 0xc0000022: - Filer's security information differs from domain controller.
Thu Jun 7 00:21:08 EDT [xxxxx: nis_worker_0:warning]: Local NIS group update failed. Could not download the group file from the NIS server. Thu Jun 7 00:21:11 EDT [xxxxx: nis.server.inactive:warning]: preferred NIS Server xxx.xx.xxx.x not responding Thu Jun 7 00:21:11 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Thu Jun 7 00:21:24 EDT [xxxxx: nis.servers.not.available:error]: NIS server(s) not available. Thu Jun 7 00:21:37 EDT [xxxxx: mnt_assist:warning]: Client xxx.xxx.xx.xx (xid 0) fails to resolve via gethostbyaddr_r() for root access - host_errno = 2, errno = 61 Thu Jun 7 00:22:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Our Nagios system trips on 100% ping loss almost immediately, and then flaps until the filer is restarted.
First time, we just rebooted (the whole company twiddling their thumbs, waiting...). Second time, we begged for diagnosis time and called it in, and ended up replacing the motherboard (suspected network port). *Third* time, we no longer suspected hardware, dumped core then rebooted; core analysis didn't find anything. FOURTH time happened last week; dumped core, sent it in, awaiting analysis but not hopeful.
This has only happened on our 6070's. Never on our 980's or R200's (running same ONTAP versions). It's happened on two different ONTAP 7.2 releases. It's happened on three different 6070's (two incidents on one of them). We're running full flowcontrol. We're running vif's. System firmware on six of them (and this latest victim) was upgraded in February 2007. We have eight 6070's in one location; two have been affected. We have six 6070's in another location and one has been affected.
The only stat that looks "weird" is that ifstat shows an unusually high number of transmit queue overflows / discards (ifstat from autosupport just before last core dump):
===== IFSTAT-A =====
-- interface e0a (102 days, 11 hours, 43 minutes, 36 seconds) --
RECEIVE Frames/second: 24198 | Bytes/second: 4311k | Errors/minute:
0
Discards/minute: 0 | Total frames: 384g | Total bytes:
107t
Total errors: 1 | Total discards: 1201k | Multi/broadcast:
0
No buffers: 1201k | Non-primary u/c: 0 | Tag drop:
0
Vlan tag drop: 0 | Vlan untag drop: 0 | CRC errors:
1
Alignment errors: 0 | Runt frames: 0 | Long frames:
0
Fragment: 0 | Jabber: 0 | Xon:
0
Xoff: 0 | Ring full: 0 | Jumbo:
0
Jumbo error: 0 TRANSMIT Frames/second: 29384 | Bytes/second: 13057k | Errors/minute:
0
Discards/minute: 0 | Total frames: 470g | Total bytes:
314t
Total errors: 0 | Total discards: 954k | Multi/broadcast:
123k
Queue overflows: 954k | No buffers: 0 | Single collision:
0
Multi collisions: 0 | Late collisions: 0 | Max collisions:
0
Deferred: 0 | Xon: 0 | Xoff:
0
MAC Internal: 0 | Jumbo: 0 LINK_INFO Current state: up | Up to downs: 1 | Speed:
1000m
Duplex: full | Flowcontrol: full
Switch side shows no extraordinary errors.
So, I'm fishing for some ideas here. Waiting for it to happen again to gather network traces is not an optimal strategy (as much as I'd like to have those traces as well). I'm sure NetApp support will do their best in analyzing the core, but as I mentioned above, I'm not hopeful they will find the culprit this second time. So, I'm looking for any anecdotes, other weird behavior, crazy ideas, or even possible ones [1], from this large group, in parallel with NetApp's efforts.
[1] "When you have eliminated all which is impossible, then whatever remains, however improbable, must be the truth." --Sherlock Holmes
Until next time...
The MathWorks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com ---
Hi Todd,
What kind of switches are you using, and what code rev?
Regards, Max
Folks, I'm on a fishing expedition here.
We've had four incidents over the past ~6 months of our 6070's loosing packets and becoming effectively unusable. Symptoms include *very* slow response on clients, both CIFS and NFS. Messages file on filer shows problems connecting to all external services (domain controllers, NIS servers, etc.):
[selected lines clipped]
Wed Jun 6 21:02:23 EDT [xxxxx: nis.server.inactive:warning]: preferred NIS Server xxx.xx.xxx.x not responding Wed Jun 6 21:02:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Wed Jun 6 21:15:10 EDT [xxxxx: nfsap_process:warning]: Directory service outage prevents NFS server from determining if client (xxx.xx.xxx.x has root access to path /vol/YYYYYY/YYYYYY (xid 1346843745). Client will experience delayed access during outage.
Wed Jun 6 21:15:18 EDT [xxxxx: rpc.client.error:error]: yp_match: clnt_call: RPC: Timed out
Wed Jun 6 21:16:45 EDT [xxxxx: rshd_0:error]: rshd: when reading user name to use on this machine from ZZZZZZZZZ, it didn't arrive within 60 seconds.
Wed Jun 6 21:28:54 EDT [xxxxx: auth.dc.trace.DCConnection.errorMsg:error]: AUTH: Domain Controller error: NetLogon error 0xc0000022: - Filer's security information differs from domain controller.
Thu Jun 7 00:21:08 EDT [xxxxx: nis_worker_0:warning]: Local NIS group update failed. Could not download the group file from the NIS server. Thu Jun 7 00:21:11 EDT [xxxxx: nis.server.inactive:warning]: preferred NIS Server xxx.xx.xxx.x not responding Thu Jun 7 00:21:11 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Thu Jun 7 00:21:24 EDT [xxxxx: nis.servers.not.available:error]: NIS server(s) not available. Thu Jun 7 00:21:37 EDT [xxxxx: mnt_assist:warning]: Client xxx.xxx.xx.xx (xid 0) fails to resolve via gethostbyaddr_r() for root access - host_errno = 2, errno = 61 Thu Jun 7 00:22:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Our Nagios system trips on 100% ping loss almost immediately, and then flaps until the filer is restarted.
First time, we just rebooted (the whole company twiddling their thumbs, waiting...). Second time, we begged for diagnosis time and called it in, and ended up replacing the motherboard (suspected network port). *Third* time, we no longer suspected hardware, dumped core then rebooted; core analysis didn't find anything. FOURTH time happened last week; dumped core, sent it in, awaiting analysis but not hopeful.
This has only happened on our 6070's. Never on our 980's or R200's (running same ONTAP versions). It's happened on two different ONTAP 7.2 releases. It's happened on three different 6070's (two incidents on one of them). We're running full flowcontrol. We're running vif's. System firmware on six of them (and this latest victim) was upgraded in February 2007. We have eight 6070's in one location; two have been affected. We have six 6070's in another location and one has been affected.
The only stat that looks "weird" is that ifstat shows an unusually high number of transmit queue overflows / discards (ifstat from autosupport just before last core dump):
===== IFSTAT-A =====
-- interface e0a (102 days, 11 hours, 43 minutes, 36 seconds) --
RECEIVE Frames/second: 24198 | Bytes/second: 4311k | Errors/minute:
0
Discards/minute: 0 | Total frames: 384g | Total bytes:
107t
Total errors: 1 | Total discards: 1201k | Multi/broadcast:
0
No buffers: 1201k | Non-primary u/c: 0 | Tag drop:
0
Vlan tag drop: 0 | Vlan untag drop: 0 | CRC errors:
1
Alignment errors: 0 | Runt frames: 0 | Long frames:
0
Fragment: 0 | Jabber: 0 | Xon:
0
Xoff: 0 | Ring full: 0 | Jumbo:
0
Jumbo error: 0 TRANSMIT Frames/second: 29384 | Bytes/second: 13057k | Errors/minute:
0
Discards/minute: 0 | Total frames: 470g | Total bytes:
314t
Total errors: 0 | Total discards: 954k | Multi/broadcast:
123k
Queue overflows: 954k | No buffers: 0 | Single collision:
0
Multi collisions: 0 | Late collisions: 0 | Max collisions:
0
Deferred: 0 | Xon: 0 | Xoff:
0
MAC Internal: 0 | Jumbo: 0 LINK_INFO Current state: up | Up to downs: 1 | Speed:
1000m
Duplex: full | Flowcontrol: full
Switch side shows no extraordinary errors.
So, I'm fishing for some ideas here. Waiting for it to happen again to gather network traces is not an optimal strategy (as much as I'd like to have those traces as well). I'm sure NetApp support will do their best in analyzing the core, but as I mentioned above, I'm not hopeful they will find the culprit this second time. So, I'm looking for any anecdotes, other weird behavior, crazy ideas, or even possible ones [1], from this large group, in parallel with NetApp's efforts.
[1] "When you have eliminated all which is impossible, then whatever remains, however improbable, must be the truth." --Sherlock Holmes
Until next time...
The MathWorks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com
Todd, What 7.2 OS revs. We bled greatly over this when we first upgraded to 6070s and 7.2. The directory service outage sounds alot like what we experienced and what triggered us to go to 7.2P4. We are now in the process of rolling out 7.2.2P1 which also has all of our fixes.
We have 10 6070s at our Boston site, 10 at our Austin site and 2 in our Sunnyvale site. 7.2P4 is the rev that fixed most of our issues and we have been stable since going to this.
Here are the bugs that hurt us when we upgraded:
1. 223781 2. 213330 3. 225835 4. 225936 5. 225731 6. 195670 7. 179451 8. 184424 9. 195348
Some are worse than others. Pay close attention to 213330 and 179451.
We export by NIS netgroups and our netgroup file is huge which we hear made the pain much worse.
Let me know if you have any questions. C-
On Tue, Jun 12, 2007 at 02:11:44PM -0400, Todd C. Merrill wrote:
Folks, I'm on a fishing expedition here.
We've had four incidents over the past ~6 months of our 6070's loosing packets and becoming effectively unusable. Symptoms include *very* slow response on clients, both CIFS and NFS. Messages file on filer shows problems connecting to all external services (domain controllers, NIS servers, etc.):
[selected lines clipped]
Wed Jun 6 21:02:23 EDT [xxxxx: nis.server.inactive:warning]: preferred NIS Server xxx.xx.xxx.x not responding Wed Jun 6 21:02:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Wed Jun 6 21:15:10 EDT [xxxxx: nfsap_process:warning]: Directory service outage prevents NFS server from determining if client (xxx.xx.xxx.x has root access to path /vol/YYYYYY/YYYYYY (xid 1346843745). Client will experience delayed access during outage.
Wed Jun 6 21:15:18 EDT [xxxxx: rpc.client.error:error]: yp_match: clnt_call: RPC: Timed out
Wed Jun 6 21:16:45 EDT [xxxxx: rshd_0:error]: rshd: when reading user name to use on this machine from ZZZZZZZZZ, it didn't arrive within 60 seconds.
Wed Jun 6 21:28:54 EDT [xxxxx: auth.dc.trace.DCConnection.errorMsg:error]: AUTH: Domain Controller error: NetLogon error 0xc0000022: - Filer's security information differs from domain controller.
Thu Jun 7 00:21:08 EDT [xxxxx: nis_worker_0:warning]: Local NIS group update failed. Could not download the group file from the NIS server. Thu Jun 7 00:21:11 EDT [xxxxx: nis.server.inactive:warning]: preferred NIS Server xxx.xx.xxx.x not responding Thu Jun 7 00:21:11 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Thu Jun 7 00:21:24 EDT [xxxxx: nis.servers.not.available:error]: NIS server(s) not available. Thu Jun 7 00:21:37 EDT [xxxxx: mnt_assist:warning]: Client xxx.xxx.xx.xx (xid 0) fails to resolve via gethostbyaddr_r() for root access - host_errno = 2, errno = 61 Thu Jun 7 00:22:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Our Nagios system trips on 100% ping loss almost immediately, and then flaps until the filer is restarted.
First time, we just rebooted (the whole company twiddling their thumbs, waiting...). Second time, we begged for diagnosis time and called it in, and ended up replacing the motherboard (suspected network port). *Third* time, we no longer suspected hardware, dumped core then rebooted; core analysis didn't find anything. FOURTH time happened last week; dumped core, sent it in, awaiting analysis but not hopeful.
This has only happened on our 6070's. Never on our 980's or R200's (running same ONTAP versions). It's happened on two different ONTAP 7.2 releases. It's happened on three different 6070's (two incidents on one of them). We're running full flowcontrol. We're running vif's. System firmware on six of them (and this latest victim) was upgraded in February 2007. We have eight 6070's in one location; two have been affected. We have six 6070's in another location and one has been affected.
The only stat that looks "weird" is that ifstat shows an unusually high number of transmit queue overflows / discards (ifstat from autosupport just before last core dump):
===== IFSTAT-A =====
-- interface e0a (102 days, 11 hours, 43 minutes, 36 seconds) --
RECEIVE Frames/second: 24198 | Bytes/second: 4311k | Errors/minute: 0 Discards/minute: 0 | Total frames: 384g | Total bytes: 107t Total errors: 1 | Total discards: 1201k | Multi/broadcast: 0 No buffers: 1201k | Non-primary u/c: 0 | Tag drop: 0 Vlan tag drop: 0 | Vlan untag drop: 0 | CRC errors: 1 Alignment errors: 0 | Runt frames: 0 | Long frames: 0 Fragment: 0 | Jabber: 0 | Xon: 0 Xoff: 0 | Ring full: 0 | Jumbo: 0 Jumbo error: 0 TRANSMIT Frames/second: 29384 | Bytes/second: 13057k | Errors/minute: 0 Discards/minute: 0 | Total frames: 470g | Total bytes: 314t Total errors: 0 | Total discards: 954k | Multi/broadcast: 123k Queue overflows: 954k | No buffers: 0 | Single collision: 0 Multi collisions: 0 | Late collisions: 0 | Max collisions: 0 Deferred: 0 | Xon: 0 | Xoff: 0 MAC Internal: 0 | Jumbo: 0 LINK_INFO Current state: up | Up to downs: 1 | Speed: 1000m Duplex: full | Flowcontrol: full
Switch side shows no extraordinary errors.
So, I'm fishing for some ideas here. Waiting for it to happen again to gather network traces is not an optimal strategy (as much as I'd like to have those traces as well). I'm sure NetApp support will do their best in analyzing the core, but as I mentioned above, I'm not hopeful they will find the culprit this second time. So, I'm looking for any anecdotes, other weird behavior, crazy ideas, or even possible ones [1], from this large group, in parallel with NetApp's efforts.
[1] "When you have eliminated all which is impossible, then whatever remains, however improbable, must be the truth." --Sherlock Holmes
Until next time...
The MathWorks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com
Todd:
My first focus would be on the LAN switch fabric- making sure that spanning tree is functioning correctly; if you are using VLANs, that your ISL trunks are correct; this smells like some sort of broadcast storm/loop. You're describing a situation where 'all of a sudden' you get a 100% ping loss and then flapping up and down on the network, that really sounds like a switching issue- as if the spanning tree is converging because something ELSE got added to the segment. Do you have Portfast set on the ports the netapp is connected to? How many interfaces do you have?
OR- perhaps someone is introducing a device with the same IP as the netapp?
Just like Sherlock Holmes said....
Glenn from Voyant (formerly known as the other one)
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner- toasters@mathworks.com] On Behalf Of Todd C. Merrill Sent: Tuesday, June 12, 2007 2:12 PM To: toasters@mathworks.com Subject: 6070's loosing packets, become unusable
Folks, I'm on a fishing expedition here.
We've had four incidents over the past ~6 months of our 6070's loosing packets and becoming effectively unusable. Symptoms include *very* slow response on clients, both CIFS and NFS. Messages file on filer shows problems connecting to all external services (domain controllers, NIS servers, etc.):
[selected lines clipped]
Wed Jun 6 21:02:23 EDT [xxxxx: nis.server.inactive:warning]:
preferred
NIS Server xxx.xx.xxx.x not responding Wed Jun 6 21:02:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Wed Jun 6 21:15:10 EDT [xxxxx: nfsap_process:warning]: Directory service outage prevents NFS server from determining if client (xxx.xx.xxx.x has root access to path /vol/YYYYYY/YYYYYY (xid 1346843745). Client will experience delayed access during outage.
Wed Jun 6 21:15:18 EDT [xxxxx: rpc.client.error:error]: yp_match: clnt_call: RPC: Timed out
Wed Jun 6 21:16:45 EDT [xxxxx: rshd_0:error]: rshd: when reading user name to use on this machine from ZZZZZZZZZ, it didn't arrive within 60 seconds.
Wed Jun 6 21:28:54 EDT [xxxxx: auth.dc.trace.DCConnection.errorMsg:error]: AUTH: Domain Controller error: NetLogon error 0xc0000022: - Filer's security information differs from domain controller.
Thu Jun 7 00:21:08 EDT [xxxxx: nis_worker_0:warning]: Local NIS group update failed. Could not download the group file from the NIS server. Thu Jun 7 00:21:11 EDT [xxxxx: nis.server.inactive:warning]:
preferred
NIS Server xxx.xx.xxx.x not responding Thu Jun 7 00:21:11 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Thu Jun 7 00:21:24 EDT [xxxxx: nis.servers.not.available:error]: NIS server(s) not available. Thu Jun 7 00:21:37 EDT [xxxxx: mnt_assist:warning]: Client xxx.xxx.xx.xx (xid 0) fails to resolve via gethostbyaddr_r() for root access - host_errno = 2, errno = 61 Thu Jun 7 00:22:23 EDT [xxxxx: nis.server.active:notice]: Bound to preferred NIS server xxx.xx.xxx.x
Our Nagios system trips on 100% ping loss almost immediately, and then flaps until the filer is restarted.
First time, we just rebooted (the whole company twiddling their thumbs, waiting...). Second time, we begged for diagnosis time and called it in, and ended up replacing the motherboard (suspected
network
port). *Third* time, we no longer suspected hardware, dumped core
then
rebooted; core analysis didn't find anything. FOURTH time happened last week; dumped core, sent it in, awaiting analysis but not hopeful.
This has only happened on our 6070's. Never on our 980's or R200's (running same ONTAP versions). It's happened on two different ONTAP 7.2 releases. It's happened on three different 6070's (two incidents on one of them). We're running full flowcontrol. We're running
vif's.
System firmware on six of them (and this latest victim) was upgraded
in
February 2007. We have eight 6070's in one location; two have been affected. We have six 6070's in another location and one has been affected.
The only stat that looks "weird" is that ifstat shows an unusually high number of transmit queue overflows / discards (ifstat from autosupport just before last core dump):
===== IFSTAT-A =====
-- interface e0a (102 days, 11 hours, 43 minutes, 36 seconds) --
RECEIVE Frames/second: 24198 | Bytes/second: 4311k | Errors/minute:
0
Discards/minute: 0 | Total frames: 384g | Total bytes:
107t
Total errors: 1 | Total discards: 1201k |
Multi/broadcast:
0
No buffers: 1201k | Non-primary u/c: 0 | Tag drop:
0
Vlan tag drop: 0 | Vlan untag drop: 0 | CRC errors:
1
Alignment errors: 0 | Runt frames: 0 | Long frames:
0
Fragment: 0 | Jabber: 0 | Xon:
0
Xoff: 0 | Ring full: 0 | Jumbo:
0
Jumbo error: 0 TRANSMIT Frames/second: 29384 | Bytes/second: 13057k | Errors/minute:
0
Discards/minute: 0 | Total frames: 470g | Total bytes:
314t
Total errors: 0 | Total discards: 954k |
Multi/broadcast:
123k
Queue overflows: 954k | No buffers: 0 | Single
collision: 0
Multi collisions: 0 | Late collisions: 0 | Max collisions:
0
Deferred: 0 | Xon: 0 | Xoff:
0
MAC Internal: 0 | Jumbo: 0 LINK_INFO Current state: up | Up to downs: 1 | Speed:
1000m
Duplex: full | Flowcontrol: full
Switch side shows no extraordinary errors.
So, I'm fishing for some ideas here. Waiting for it to happen again to gather network traces is not an optimal strategy (as much as I'd like to have those traces as well). I'm sure NetApp support will do their best in analyzing the core, but as I mentioned above, I'm not hopeful they will find the culprit this second time. So, I'm looking for any anecdotes, other weird behavior, crazy ideas, or even possible ones [1], from this large group, in parallel with NetApp's efforts.
[1] "When you have eliminated all which is impossible, then whatever remains, however improbable, must be the truth." --Sherlock Holmes
Until next time...
The MathWorks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com