Hey all,
I have two ancient 8.0.5 cDOT clusters running several 6080s with pairs of
Cisco 5010s for 10gig cluster net, 2 x 10gig cluster ports per filer (can't
upgrade them, no support, long story..)
Anyway, just recently one of these clusters started periodically spewing
some partial and full packet loss errors on the cluster net - multiple
filers, multiple ports, not local to a specific cluster switch. Tried
powering off each cluster switch, I get the same thing whether it's one,
the other, or both switches in play.
I notice that flow control is enabled on the ports filer-side, but not on
the cluster switches. I never really paid any attention to that, as they
came from NetApp in this config, and have been working fine to date.
However now that I've been doing some digging, I see a lot of mentions that
it's recommended that flow-control be disabled completely on 10 gigabit
gear with NetApp.
Since it's enabled on the filers, but not on the cluster switches, it's
effectively off anyway, right? Or am I missing something? The switches
see plenty of pause frames coming in from the filers, so it would appear
they are wanting things to slow down.
I'm wondering if I'm hitting some threshold on the filers that's causing
this periodic packet loss. It's not associated with a specific port, so
doesn't appear to be a specific optic burning out, and it's present
regardless of using either or both switches.
stats periodic only shows a GB or so passing over the net between these 8 x
6080s, and ops cluster wide are not that high at all, so I'm kinda
stumped. I've dealt with plenty of bad optics in the past, and we usually
run out of steam and the disk or head level, so these cluster net issues
are new to me.
My thought is to go ahead and try enabling flow control on the switches,
but that seems to be recommended against.
Any ideas?