I'd take a close look at spanning tree and VLAN ID's.
Spanning tree is just generally evil and likes to cause totally unexpected problems out of spite. I barely understand STP, I've just been bit repeatedly.
VLAN ID's have tripped me up a lot. Traffic will seem to be flowing, but because there's either a VLAN ID mis-set or a default VLAN ID that is incorrect the packets go nowhere.
From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Edward Rolison Sent: Monday, November 07, 2016 12:42 PM To: toasters@teaparty.net Subject: Troubleshooting 'unreachable' on a CDOT port
Hello fellow NetApp admins. Over the weekend, I hit a really quite "interesting" sort of a problem. One of those weekends that ... no one really wants to have. It involved a firmware upgrade on a switch going catastrophically wrong. causing chaos for several hours.
And off the back of that - one of our two CDOT nodes, it's primary 'data' interface LACP group has .. seemingly died.
I say "seemingly" because:
- Snooping the interfaces sees packets going in and out. (Mostly "arp"). - But the switch side "snoop" doesn't see the ARP replies. And thus never 'learns' the mac, and doesn't route traffic to it.
This happens on both ports of a LACP group, and even moving it to another switch entirely hasn't helped. Manually offlining the ports doesn't help (Except if I do both, it migrates the lif automatically).
But switching the lifs over onto the other head - has fixed it, for now. (although obviously, we're failed over, and have reduced resilience).
Has anyone run into anything similar? Or can give me some insight as to what could explain this perplexing behaviour? The 'source' of the problem was - probably - some serious network strangeness. Loops, vlans going up and down, all sorts of chaos.
I haven't (yet) rebooted the failed node, as the vservers are running quite merrily.
I'm wondering if there's some sort of DOS/arp flood protection that might be tripping us up.
Thanks, Ed.