Hello fellow NetApp admins.
Over the weekend, I hit a really quite "interesting" sort of a problem. One
of those weekends that ... no one really wants to have.
It involved a firmware upgrade on a switch going catastrophically wrong.
causing chaos for several hours.
And off the back of that - one of our two CDOT nodes, it's primary 'data'
interface LACP group has .. seemingly died.
I say "seemingly" because:
- Snooping the interfaces sees packets going in and out. (Mostly "arp").
- But the switch side "snoop" doesn't see the ARP replies. And thus never
'learns' the mac, and doesn't route traffic to it.
This happens on both ports of a LACP group, and even moving it to another
switch entirely hasn't helped. Manually offlining the ports doesn't help
(Except if I do both, it migrates the lif automatically).
But switching the lifs over onto the other head - has fixed it, for now.
(although obviously, we're failed over, and have reduced resilience).
Has anyone run into anything similar? Or can give me some insight as to
what could explain this perplexing behaviour?
The 'source' of the problem was - probably - some serious network
strangeness. Loops, vlans going up and down, all sorts of chaos.
I haven't (yet) rebooted the failed node, as the vservers are running quite
merrily.
I'm wondering if there's some sort of DOS/arp flood protection that might
be tripping us up.
Thanks,
Ed.