Hi Mike,

It is a documented best practice to disable flow control on all "cluster" & "data" network ports. In order to disable flow control on cluster ports without causing a disruption, please refer to the following process:

Migrate the “clus1” LIF to the “clus2” network port

# net int migrate -vserver cs005-pn01 -lif clus1 -destination-node cs005-pn01 -destination-port e2a

Disable flow control on the “clus1” network port

# net port modify -node cs005-pn01 -port e1a -flowcontrol-admin none

Once the “clus1” network port is back online revert the “clus1” LIF back to its home network port

# net int revert -vserver cs005-pn01 -lif clus1

Verify the cluster is still healthy

# cluster ping-cluster -node cs005-pn01

Perform the same procedure this time replacing “clus1” with “clus2"

In order to help you better would you be able to post the exact error messages you are seeing? Also is there a reason why you cannot upgrade to 8.2.2?

Thanks,

Dan Burkland

Sent from my mobile device, please excuse typos.

On Oct 30, 2014, at 9:40 PM, Mike Thompson <mike.thompson@gmail.com> wrote:

Hey all,

I have two ancient 8.0.5 cDOT clusters running several 6080s with pairs of Cisco 5010s for 10gig cluster net, 2 x 10gig cluster ports per filer (can't upgrade them, no support, long story..)

Anyway, just recently one of these clusters started periodically spewing some partial and full packet loss errors on the cluster net - multiple filers, multiple ports, not local to a specific cluster switch.   Tried powering off each cluster switch, I get the same thing whether it's one, the other, or both switches in play.

I notice that flow control is enabled on the ports filer-side, but not on the cluster switches. I never really paid any attention to that, as they came from NetApp in this config, and have been working fine to date.

However now that I've been doing some digging, I see a lot of mentions that it's recommended that flow-control be disabled completely on 10 gigabit gear with NetApp.

Since it's enabled on the filers, but not on the cluster switches, it's effectively off anyway, right? Or am I missing something? The switches see plenty of pause frames coming in from the filers, so it would appear they are wanting things to slow down.

I'm wondering if I'm hitting some threshold on the filers that's causing this periodic packet loss. It's not associated with a specific port, so doesn't appear to be a specific optic burning out, and it's present regardless of using either or both switches.

stats periodic only shows a GB or so passing over the net between these 8 x 6080s, and ops cluster wide are not that high at all, so I'm kinda stumped.   I've dealt with plenty of bad optics in the past, and we usually run out of steam and the disk or head level, so these cluster net issues are new to me.

My thought is to go ahead and try enabling flow control on the switches, but that seems to be recommended against.

Any ideas?

_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters