Hey all,
I have two ancient 8.0.5 cDOT clusters running several 6080s with pairs of Cisco 5010s for 10gig cluster net, 2 x 10gig cluster ports per filer (can't upgrade them, no support, long story..)
Anyway, just recently one of these clusters started periodically spewing some partial and full packet loss errors on the cluster net - multiple filers, multiple ports, not local to a specific cluster switch. Tried powering off each cluster switch, I get the same thing whether it's one, the other, or both switches in play.
I notice that flow control is enabled on the ports filer-side, but not on the cluster switches. I never really paid any attention to that, as they came from NetApp in this config, and have been working fine to date.
However now that I've been doing some digging, I see a lot of mentions that it's recommended that flow-control be disabled completely on 10 gigabit gear with NetApp.
Since it's enabled on the filers, but not on the cluster switches, it's effectively off anyway, right? Or am I missing something? The switches see plenty of pause frames coming in from the filers, so it would appear they are wanting things to slow down.
I'm wondering if I'm hitting some threshold on the filers that's causing this periodic packet loss. It's not associated with a specific port, so doesn't appear to be a specific optic burning out, and it's present regardless of using either or both switches.
stats periodic only shows a GB or so passing over the net between these 8 x 6080s, and ops cluster wide are not that high at all, so I'm kinda stumped. I've dealt with plenty of bad optics in the past, and we usually run out of steam and the disk or head level, so these cluster net issues are new to me.
My thought is to go ahead and try enabling flow control on the switches, but that seems to be recommended against.
Any ideas?