Hello fellow NetApp admins. Over the weekend, I hit a really quite "interesting" sort of a problem. One of those weekends that ... no one really wants to have. It involved a firmware upgrade on a switch going catastrophically wrong. causing chaos for several hours.
And off the back of that - one of our two CDOT nodes, it's primary 'data' interface LACP group has .. seemingly died.
I say "seemingly" because:
- Snooping the interfaces sees packets going in and out. (Mostly "arp"). - But the switch side "snoop" doesn't see the ARP replies. And thus never 'learns' the mac, and doesn't route traffic to it.
This happens on both ports of a LACP group, and even moving it to another switch entirely hasn't helped. Manually offlining the ports doesn't help (Except if I do both, it migrates the lif automatically).
But switching the lifs over onto the other head - has fixed it, for now. (although obviously, we're failed over, and have reduced resilience).
Has anyone run into anything similar? Or can give me some insight as to what could explain this perplexing behaviour? The 'source' of the problem was - probably - some serious network strangeness. Loops, vlans going up and down, all sorts of chaos.
I haven't (yet) rebooted the failed node, as the vservers are running quite merrily.
I'm wondering if there's some sort of DOS/arp flood protection that might be tripping us up.
Thanks, Ed.
I'd take a close look at spanning tree and VLAN ID's.
Spanning tree is just generally evil and likes to cause totally unexpected problems out of spite. I barely understand STP, I've just been bit repeatedly.
VLAN ID's have tripped me up a lot. Traffic will seem to be flowing, but because there's either a VLAN ID mis-set or a default VLAN ID that is incorrect the packets go nowhere.
From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Edward Rolison Sent: Monday, November 07, 2016 12:42 PM To: toasters@teaparty.net Subject: Troubleshooting 'unreachable' on a CDOT port
Hello fellow NetApp admins. Over the weekend, I hit a really quite "interesting" sort of a problem. One of those weekends that ... no one really wants to have. It involved a firmware upgrade on a switch going catastrophically wrong. causing chaos for several hours.
And off the back of that - one of our two CDOT nodes, it's primary 'data' interface LACP group has .. seemingly died.
I say "seemingly" because:
- Snooping the interfaces sees packets going in and out. (Mostly "arp"). - But the switch side "snoop" doesn't see the ARP replies. And thus never 'learns' the mac, and doesn't route traffic to it.
This happens on both ports of a LACP group, and even moving it to another switch entirely hasn't helped. Manually offlining the ports doesn't help (Except if I do both, it migrates the lif automatically).
But switching the lifs over onto the other head - has fixed it, for now. (although obviously, we're failed over, and have reduced resilience).
Has anyone run into anything similar? Or can give me some insight as to what could explain this perplexing behaviour? The 'source' of the problem was - probably - some serious network strangeness. Loops, vlans going up and down, all sorts of chaos.
I haven't (yet) rebooted the failed node, as the vservers are running quite merrily.
I'm wondering if there's some sort of DOS/arp flood protection that might be tripping us up.
Thanks, Ed.
On Mon, Nov 7, 2016 at 6:42 AM, Edward Rolison ed.rolison@gmail.com wrote:
Ed.
Ok...gotta say it....
Have you tried physically unlinking and relinking the network connection? Occasionally, I have see this fix weird problems.
Also, have you checked the port settings on the switch to make sure they line up as expected?
Do you have portfast enabled for the LACP ports (spanning-tree portfast trunk?)
--tmac
*Tim McCarthy, **Principal Consultant*
*Proud Member of the #NetAppATeam https://twitter.com/NetAppATeam*
*I Blog at TMACsRack https://tmacsrack.wordpress.com/*
Thank you for the replies - we've got to a point where we _think_ we're "just" tickling a known issue: *987243*
In certain instances, the Ethernet interface on the UTA2 X1143-R6 adapter and onboard ports might stop sending packets due to lack of transmission resources.
So now we're just lining up for a reboot at a suitable outage window, and a code update later if that's done the trick.
In the interim though - does anyone know of a good workaround for rerouting my intercluster (replication) traffic? I can't failover that interface to another node - and as the interface is "up", but not "working" my replication jobs have failed.
On 7 November 2016 at 12:00, tmac tmacmd@gmail.com wrote:
On Mon, Nov 7, 2016 at 6:42 AM, Edward Rolison ed.rolison@gmail.com wrote:
Ed.
Ok...gotta say it....
Have you tried physically unlinking and relinking the network connection? Occasionally, I have see this fix weird problems.
Also, have you checked the port settings on the switch to make sure they line up as expected?
Do you have portfast enabled for the LACP ports (spanning-tree portfast trunk?)
--tmac
*Tim McCarthy, **Principal Consultant*
*Proud Member of the #NetAppATeam https://twitter.com/NetAppATeam*
*I Blog at TMACsRack https://tmacsrack.wordpress.com/*
You should be able to failover it to another port on the same node.
?????????? ? iPhone
8 ????. 2016 ?., ? 13:43, Edward Rolison <ed.rolison@gmail.commailto:ed.rolison@gmail.com> ???????(?):
Thank you for the replies - we've got to a point where we _think_ we're "just" tickling a known issue: 987243
In certain instances, the Ethernet interface on the UTA2 X1143-R6 adapter and onboard ports might stop sending packets due to lack of transmission resources.
So now we're just lining up for a reboot at a suitable outage window, and a code update later if that's done the trick.
In the interim though - does anyone know of a good workaround for rerouting my intercluster (replication) traffic? I can't failover that interface to another node - and as the interface is "up", but not "working" my replication jobs have failed.
On 7 November 2016 at 12:00, tmac <tmacmd@gmail.commailto:tmacmd@gmail.com> wrote:
On Mon, Nov 7, 2016 at 6:42 AM, Edward Rolison <ed.rolison@gmail.commailto:ed.rolison@gmail.com> wrote: Ed.
Ok...gotta say it....
Have you tried physically unlinking and relinking the network connection? Occasionally, I have see this fix weird problems.
Also, have you checked the port settings on the switch to make sure they line up as expected?
Do you have portfast enabled for the LACP ports (spanning-tree portfast trunk?)
--tmac
Tim McCarthy, Principal Consultant
Proud Member of the #NetAppATeamhttps://twitter.com/NetAppATeam
I Blog at TMACsRackhttps://tmacsrack.wordpress.com/
_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters