Hey all,
I have two ancient 8.0.5 cDOT clusters running several 6080s with pairs of Cisco 5010s for 10gig cluster net, 2 x 10gig cluster ports per filer (can't upgrade them, no support, long story..)
Anyway, just recently one of these clusters started periodically spewing some partial and full packet loss errors on the cluster net - multiple filers, multiple ports, not local to a specific cluster switch. Tried powering off each cluster switch, I get the same thing whether it's one, the other, or both switches in play.
I notice that flow control is enabled on the ports filer-side, but not on the cluster switches. I never really paid any attention to that, as they came from NetApp in this config, and have been working fine to date.
However now that I've been doing some digging, I see a lot of mentions that it's recommended that flow-control be disabled completely on 10 gigabit gear with NetApp.
Since it's enabled on the filers, but not on the cluster switches, it's effectively off anyway, right? Or am I missing something? The switches see plenty of pause frames coming in from the filers, so it would appear they are wanting things to slow down.
I'm wondering if I'm hitting some threshold on the filers that's causing this periodic packet loss. It's not associated with a specific port, so doesn't appear to be a specific optic burning out, and it's present regardless of using either or both switches.
stats periodic only shows a GB or so passing over the net between these 8 x 6080s, and ops cluster wide are not that high at all, so I'm kinda stumped. I've dealt with plenty of bad optics in the past, and we usually run out of steam and the disk or head level, so these cluster net issues are new to me.
My thought is to go ahead and try enabling flow control on the switches, but that seems to be recommended against.
Any ideas?
Hi Mike,
It is a documented best practice to disable flow control on all "cluster" & "data" network ports. In order to disable flow control on cluster ports without causing a disruption, please refer to the following process:
- Login to the SP for each applicable node - Migrate the “clus1” LIF to the “clus2” network port - - # net int migrate -vserver cs005-pn01 -lif clus1 -destination-node cs005-pn01 -destination-port e2a - Disable flow control on the “clus1” network port - - # net port modify -node cs005-pn01 -port e1a -flowcontrol-admin none - Once the “clus1” network port is back online revert the “clus1” LIF back to its home network port - - # net int revert -vserver cs005-pn01 -lif clus1 - Verify the cluster is still healthy - - # cluster ping-cluster -node cs005-pn01 - Perform the same procedure this time replacing “clus1” with “clus2"
In order to help you better would you be able to post the exact error messages you are seeing? Also is there a reason why you cannot upgrade to 8.2.2?
Thanks,
Dan Burkland
Sent from my mobile device, please excuse typos.
On Oct 30, 2014, at 9:40 PM, Mike Thompson mike.thompson@gmail.com wrote:
Hey all,
I have two ancient 8.0.5 cDOT clusters running several 6080s with pairs of Cisco 5010s for 10gig cluster net, 2 x 10gig cluster ports per filer (can't upgrade them, no support, long story..)
Anyway, just recently one of these clusters started periodically spewing some partial and full packet loss errors on the cluster net - multiple filers, multiple ports, not local to a specific cluster switch. Tried powering off each cluster switch, I get the same thing whether it's one, the other, or both switches in play.
I notice that flow control is enabled on the ports filer-side, but not on the cluster switches. I never really paid any attention to that, as they came from NetApp in this config, and have been working fine to date.
However now that I've been doing some digging, I see a lot of mentions that it's recommended that flow-control be disabled completely on 10 gigabit gear with NetApp.
Since it's enabled on the filers, but not on the cluster switches, it's effectively off anyway, right? Or am I missing something? The switches see plenty of pause frames coming in from the filers, so it would appear they are wanting things to slow down.
I'm wondering if I'm hitting some threshold on the filers that's causing this periodic packet loss. It's not associated with a specific port, so doesn't appear to be a specific optic burning out, and it's present regardless of using either or both switches.
stats periodic only shows a GB or so passing over the net between these 8 x 6080s, and ops cluster wide are not that high at all, so I'm kinda stumped. I've dealt with plenty of bad optics in the past, and we usually run out of steam and the disk or head level, so these cluster net issues are new to me.
My thought is to go ahead and try enabling flow control on the switches, but that seems to be recommended against.
Any ideas?
_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Thanks for the info Daniel!
I can't upgrade as these clusters are no longer under support, and have about 400TB of active data hanging off them (I have plenty of spares).
Here's an example set of errors - I've been getting a couple/few batches of these per day, for the last several days. Nothing in the environment has changed recently, other than activity on the cluster is picking up due to projects underway.
(this cluster started life running Ontap GX, hence the gx in the hostnames, but are now running 8.0.5 - clus1 from each node plugged into one 5010, clus2 plugged into the other, with an 8x twinax trunk between switches):
Oct 30 14:06:14 bc-gx-2a vifmgr: vifmgr.cluscheck.droppedsome: Partial packet loss when pinging from cluster lif clus2 (node bc-gx-2a) to cluster lif clus1 (node bc-gx-3b). Oct 30 14:06:16 bc-gx-2b vifmgr: vifmgr.cluscheck.droppedsome: Partial packet loss when pinging from cluster lif clus1 (node bc-gx-2b) to cluster lif clus2 (node bc-gx-3b). Oct 30 14:06:33 bc-gx-3b vifmgr: vifmgr.cluscheck.droppedall: Total packet loss when pinging from cluster lif clus2 (node bc-gx-3b) to cluster lif clus1 (node bc-gx-1b). Oct 30 14:06:48 bc-gx-4a vifmgr: vifmgr.cluscheck.droppedsome: Partial packet loss when pinging from cluster lif clus2 (node bc-gx-4a) to cluster lif clus2 (node bc-gx-3b). Oct 30 14:06:52 bc-gx-3a vifmgr: vifmgr.cluscheck.droppedall: Total packet loss when pinging from cluster lif clus2 (node bc-gx-3a) to cluster lif clus1 (node bc-gx-1b). Oct 30 14:07:02 bc-gx-1b vifmgr: vifmgr.cluscheck.droppedsome: Partial packet loss when pinging from cluster lif clus2 (node bc-gx-1b) to cluster lif clus1 (node bc-gx-4a). Oct 30 14:07:02 bc-gx-1a vifmgr: vifmgr.cluscheck.droppedall: Total packet loss when pinging from cluster lif clus1 (node bc-gx-1a) to cluster lif clus2 (node bc-gx-1b). Oct 30 14:07:03 bc-gx-3b vifmgr: vifmgr.cluscheck.droppedall: Total packet loss when pinging from cluster lif clus2 (node bc-gx-3b) to cluster lif clus2 (node bc-gx-1b). Oct 30 14:07:10 bc-gx-4a vifmgr: vifmgr.cluscheck.droppedsome: Partial packet loss when pinging from cluster lif clus2 (node bc-gx-4a) to cluster lif clus1 (node bc-gx-3a). Oct 30 14:07:14 bc-gx-2b vifmgr: vifmgr.cluscheck.droppedall: Total packet loss when pinging from cluster lif clus1 (node bc-gx-2b) to cluster lif clus1 (node bc-gx-1b). Oct 30 14:07:17 bc-gx-1b vifmgr: vifmgr.cluscheck.droppedall: Total packet loss when pinging from cluster lif clus2 (node bc-gx-1b) to cluster lif clus2 (node bc-gx-4a). Oct 30 14:07:25 bc-gx-4a vifmgr: vifmgr.cluscheck.droppedall: Total packet loss when pinging from cluster lif clus2 (node bc-gx-4a) to cluster lif clus2 (node bc-gx-3a). Oct 30 14:07:28 bc-gx-3a vifmgr: vifmgr.cluscheck.droppedsome: Partial packet loss when pinging from cluster lif clus2 (node bc-gx-3a) to cluster lif clus2 (node bc-gx-1b). Oct 30 14:07:56 bc-gx-1b vifmgr: vifmgr.cluscheck.droppedall: Total packet loss when pinging from cluster lif clus2 (node bc-gx-1b) to cluster lif clus1 (node bc-gx-3a). Oct 30 14:08:43 bc-gx-4b vifmgr: vifmgr.cluscheck.droppedsome: Partial packet loss when pinging from cluster lif clus1 (node bc-gx-4b) to cluster lif clus2 (node bc-gx-3b). Oct 30 14:08:46 bc-gx-2b vifmgr: vifmgr.cluscheck.droppedsome: Partial packet loss when pinging from cluster lif clus2 (node bc-gx-2b) to cluster lif clus2 (node bc-gx-3b). Oct 30 14:15:08 bc-gx-3a vifmgr: vifmgr.cluscheck.droppedsome: Partial packet loss when pinging from cluster lif clus1 (node bc-gx-3a) to cluster lif clus1 (node bc-gx-1a). Oct 30 14:15:16 bc-gx-1b vifmgr: vifmgr.cluscheck.droppedall: Total packet loss when pinging from cluster lif clus1 (node bc-gx-1b) to cluster lif clus1 (node bc-gx-3a). Oct 30 14:15:19 bc-gx-2a vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus2 (node bc-gx-2a) to cluster lif clus1 (node bc-gx-3b). Oct 30 14:15:47 bc-gx-3b vifmgr: vifmgr.cluscheck.droppedsome: Partial packet loss when pinging from cluster lif clus2 (node bc-gx-3b) to cluster lif clus2 (node bc-gx-3a). Oct 30 14:15:48 bc-gx-2a vifmgr: vifmgr.cluscheck.droppedsome: Partial packet loss when pinging from cluster lif clus2 (node bc-gx-2a) to cluster lif clus1 (node bc-gx-3a). Oct 30 14:15:49 bc-gx-3a vifmgr: vifmgr.cluscheck.droppedall: Total packet loss when pinging from cluster lif clus2 (node bc-gx-3a) to cluster lif clus1 (node bc-gx-4a). Oct 30 14:16:02 bc-gx-2b vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus1 (node bc-gx-2b) to cluster lif clus2 (node bc-gx-3b). Oct 30 14:16:03 bc-gx-2a vifmgr: vifmgr.cluscheck.droppedall: Total packet loss when pinging from cluster lif clus2 (node bc-gx-2a) to cluster lif clus2 (node bc-gx-3a). Oct 30 14:16:16 bc-gx-1a vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus1 (node bc-gx-1a) to cluster lif clus2 (node bc-gx-1b). Oct 30 14:16:27 bc-gx-3b vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus2 (node bc-gx-3b) to cluster lif clus1 (node bc-gx-1b). Oct 30 14:16:35 bc-gx-3b vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus2 (node bc-gx-3b) to cluster lif clus2 (node bc-gx-1b). Oct 30 14:16:36 bc-gx-4a vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus2 (node bc-gx-4a) to cluster lif clus2 (node bc-gx-3b). Oct 30 14:16:44 bc-gx-4a vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus2 (node bc-gx-4a) to cluster lif clus1 (node bc-gx-3a). Oct 30 14:16:47 bc-gx-2b vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus1 (node bc-gx-2b) to cluster lif clus1 (node bc-gx-1b). Oct 30 14:16:48 bc-gx-4a vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus2 (node bc-gx-4a) to cluster lif clus2 (node bc-gx-3a). Oct 30 14:16:50 bc-gx-1b vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus2 (node bc-gx-1b) to cluster lif clus1 (node bc-gx-4a). Oct 30 14:16:54 bc-gx-1b vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus2 (node bc-gx-1b) to cluster lif clus2 (node bc-gx-4a). Oct 30 14:17:06 bc-gx-3a vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus2 (node bc-gx-3a) to cluster lif clus1 (node bc-gx-1b). Oct 30 14:17:18 bc-gx-3a vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus2 (node bc-gx-3a) to cluster lif clus2 (node bc-gx-1b). Oct 30 14:17:22 bc-gx-1b vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus2 (node bc-gx-1b) to cluster lif clus1 (node bc-gx-3a). Oct 30 14:17:49 bc-gx-4b vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus1 (node bc-gx-4b) to cluster lif clus2 (node bc-gx-3b). Oct 30 14:18:05 bc-gx-2b vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus2 (node bc-gx-2b) to cluster lif clus2 (node bc-gx-3b). Oct 30 14:22:52 bc-gx-2a vifmgr: vifmgr.cluscheck.droppedall: Total packet loss when pinging from cluster lif clus1 (node bc-gx-2a) to cluster lif clus1 (node bc-gx-3b). Oct 30 14:23:28 bc-gx-3b vifmgr: vifmgr.cluscheck.droppedsome: Partial packet loss when pinging from cluster lif clus1 (node bc-gx-3b) to cluster lif clus1 (node bc-gx-2b). Oct 30 14:23:28 bc-gx-3a vifmgr: vifmgr.cluscheck.droppedsome: Partial packet loss when pinging from cluster lif clus1 (node bc-gx-3a) to cluster lif clus1 (node bc-gx-3b). Oct 30 14:23:58 bc-gx-1a vifmgr: vifmgr.cluscheck.droppedall: Total packet loss when pinging from cluster lif clus1 (node bc-gx-1a) to cluster lif clus2 (node bc-gx-4b). Oct 30 14:24:15 bc-gx-2a vifmgr: vifmgr.cluscheck.droppedall: Total packet loss when pinging from cluster lif clus1 (node bc-gx-2a) to cluster lif clus1 (node bc-gx-1a). Oct 30 14:24:24 bc-gx-1b vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus1 (node bc-gx-1b) to cluster lif clus1 (node bc-gx-3a). Oct 30 14:24:42 bc-gx-3a vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus1 (node bc-gx-3a) to cluster lif clus1 (node bc-gx-1a). Oct 30 14:25:10 bc-gx-3a vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus2 (node bc-gx-3a) to cluster lif clus1 (node bc-gx-4a). Oct 30 14:25:26 bc-gx-2a vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus2 (node bc-gx-2a) to cluster lif clus1 (node bc-gx-3a). Oct 30 14:25:26 bc-gx-3b vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus2 (node bc-gx-3b) to cluster lif clus2 (node bc-gx-3a). Oct 30 14:25:30 bc-gx-2a vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus2 (node bc-gx-2a) to cluster lif clus2 (node bc-gx-3a). Oct 30 14:32:15 bc-gx-3b vifmgr: vifmgr.cluscheck.droppedsome: Partial packet loss when pinging from cluster lif clus1 (node bc-gx-3b) to cluster lif clus1 (node bc-gx-4a). Oct 30 14:32:42 bc-gx-3a vifmgr: vifmgr.cluscheck.droppedall: Total packet loss when pinging from cluster lif clus1 (node bc-gx-3a) to cluster lif clus1 (node bc-gx-3b). Oct 30 14:32:57 bc-gx-3b vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus1 (node bc-gx-3b) to cluster lif clus1 (node bc-gx-2b). Oct 30 14:33:31 bc-gx-1a vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus1 (node bc-gx-1a) to cluster lif clus2 (node bc-gx-4b). Oct 30 14:33:33 bc-gx-2a vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus1 (node bc-gx-2a) to cluster lif clus1 (node bc-gx-1a). Oct 30 14:41:24 bc-gx-2a vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus1 (node bc-gx-2a) to cluster lif clus1 (node bc-gx-3b). Oct 30 14:41:24 bc-gx-3b vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus1 (node bc-gx-3b) to cluster lif clus1 (node bc-gx-4a). Oct 30 14:41:48 bc-gx-3a vifmgr: vifmgr.cluscheck.droppednone: No packet loss when pinging from cluster lif clus1 (node bc-gx-3a) to cluster lif clus1 (node bc-gx-3b).
On Thu, Oct 30, 2014 at 9:51 PM, Daniel Burkland dburkland@dburkland.com wrote:
Hi Mike,
It is a documented best practice to disable flow control on all "cluster" & "data" network ports. In order to disable flow control on cluster ports without causing a disruption, please refer to the following process:
- Login to the SP for each applicable node
- Migrate the “clus1” LIF to the “clus2” network port
-destination-node cs005-pn01 -destination-port e2a
- # net int migrate -vserver cs005-pn01 -lif clus1
- Disable flow control on the “clus1” network port
none
- # net port modify -node cs005-pn01 -port e1a -flowcontrol-admin
- Once the “clus1” network port is back online revert the “clus1”
LIF back to its home network port - - # net int revert -vserver cs005-pn01 -lif clus1 - Verify the cluster is still healthy - - # cluster ping-cluster -node cs005-pn01 - Perform the same procedure this time replacing “clus1” with “clus2"
In order to help you better would you be able to post the exact error messages you are seeing? Also is there a reason why you cannot upgrade to 8.2.2?
Thanks,
Dan Burkland
Sent from my mobile device, please excuse typos.
On Oct 30, 2014, at 9:40 PM, Mike Thompson mike.thompson@gmail.com wrote:
Hey all,
I have two ancient 8.0.5 cDOT clusters running several 6080s with pairs of Cisco 5010s for 10gig cluster net, 2 x 10gig cluster ports per filer (can't upgrade them, no support, long story..)
Anyway, just recently one of these clusters started periodically spewing some partial and full packet loss errors on the cluster net - multiple filers, multiple ports, not local to a specific cluster switch. Tried powering off each cluster switch, I get the same thing whether it's one, the other, or both switches in play.
I notice that flow control is enabled on the ports filer-side, but not on the cluster switches. I never really paid any attention to that, as they came from NetApp in this config, and have been working fine to date.
However now that I've been doing some digging, I see a lot of mentions that it's recommended that flow-control be disabled completely on 10 gigabit gear with NetApp.
Since it's enabled on the filers, but not on the cluster switches, it's effectively off anyway, right? Or am I missing something? The switches see plenty of pause frames coming in from the filers, so it would appear they are wanting things to slow down.
I'm wondering if I'm hitting some threshold on the filers that's causing this periodic packet loss. It's not associated with a specific port, so doesn't appear to be a specific optic burning out, and it's present regardless of using either or both switches.
stats periodic only shows a GB or so passing over the net between these 8 x 6080s, and ops cluster wide are not that high at all, so I'm kinda stumped. I've dealt with plenty of bad optics in the past, and we usually run out of steam and the disk or head level, so these cluster net issues are new to me.
My thought is to go ahead and try enabling flow control on the switches, but that seems to be recommended against.
Any ideas?
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters