Re: NFS Datastore Disconnect

18 Mar 2014

      That is what we both said....disable flow control.
It was *previously* recommended to use flow control. Not any more.
Especially on 10G networks.
Disable the Flow Control. Both directions. everywhere.
--tmac
*Tim McCarthy*
*Principal Consultant*
Clustered ONTAP
   Clustered ONTAP
 NCDA ID: XK7R3GEKC1QQ2LVD           RHCE6
110-107-141https://www.redhat.com/wapps/training/certification/verify.html?certNumber=110-107-141&isSearch=False&verify=Verify
         NCSIE
ID: C14QPHE21FR4YWD4
     Expires: 08 November 2014              Current until Aug 02, 2016
   Expires: 08 November 2014
On Tue, Mar 18, 2014 at 7:56 AM, Vervloesem Wouter <
wouter.vervloesem@neoria.be> wrote:
...
This is strange, because other documents state that flow control should
not be enabled :
TR-3802 : Ethernet Storage Best Practices
"For these reasons, it's not recommended to enable flow control throughout
the network (including switches, data ports, intracluster ports).
...
FLOW CONTROL RECOMMENDATIONS
Ensure flow control is disabled on both the storage controller and the
switch it is connected to."
Also, in several support cases we were told to disable flow control.
Mvg,
Wouter Vervloesem
Neoria - Uptime Group
Veldkant 35D
B-2550 Kontich
Tel: +32 (0)3 451 23 82
Mailto: wouter.vervloesem@neoria.be
Web: http://www.neoria.be
Op 18-mrt.-2014, om 12:41 heeft Sebastian Goetze spgoetze@gmail.com het
volgende geschreven:
...
I'll second that!
To quote tr-4068:
...
6.6  Flow Control Overview
Modern network equipment and protocols generally handle port congestion
better than in the past. While
...
...
NetApp had previously recommended flow control "send" on ESX hosts and
NetApp storage controllers,
...
...
the current recommendation, especially with 10GbE equipment, is to
disable flow control on ESXi,
...
...
NetApp FAS, and the switches in between.
With ESXi 5, flow control is not exposed in the vSphere client GUI. The
ethtool command sets flow control
...
...
on a per-interface basis. There are three options for flow control:
autoneg, tx, and rx. tx is equivalent to
...
...
"send" on other devices.
Note:  With some NIC drivers, including some Intel
(R)
drivers, autoneg must be disabled in the same
command line for tx and rx to take effect.
~ # ethtool -A vmnic2 autoneg off rx off tx off
~ # ethtool -a vmnic2
Pause parameters for vmnic2:
Autonegotiate: off
RX: off
TX: off
And the symptoms fit well: disconnecting ("pausing") traffic in a
congested scenario - maybe just from the one side - and never receiving a
'unpause' frame, thereby disconnecting the datastore for good.
...
HTH
Sebastian
On 3/18/2014 11:26 AM, tmac wrote:
...
It would be a fantastic idea to turn off all flow control in bot
directions. Let the TCP congestion protocol handle it.
...
...
That very well could be the issue.
--tmac
Tim McCarthy
Principal Consultant
    Clustered ONTAP

   Clustered ONTAP

...
...
NCDA ID: XK7R3GEKC1QQ2LVD           RHCE6 110-107-141           NCSIE
ID: C14QPHE21FR4YWD4
...
...
 Expires: 08 November 2014              Current until Aug 02, 2016

    Expires: 08 November 2014

...
...
On Tue, Mar 18, 2014 at 1:24 AM, Philbert Rupkins <
philbertrupkins@gmail.com> wrote:
...
...
Thanks for the response!
Yes.  We are running 10g.  I know flow control is enabled on the 10g
adapters on the NetApps.     Not sure if it is enabled on the switches.
I'll have to check with our networking team.    Do you know if a pause
frame would show up somewhere in the port statistics?   The switches are
Nexus 5K's.
...
...
We have been examining TCP Window Sizes during packet traces but have
not found anything interesting.   Of course, whenever we run a packet
capture the problem never occurs so TCP Window Sizes could still be an
issue.
...
...
On Tue, Mar 18, 2014 at 12:04 AM, Wilkinson, Brent <
Brent.Wilkinson@netapp.com> wrote:
...
...
Are you running 10g? If so what are the flow control settings end to
end?
...
...
Sent from mobile device.
On Mar 17, 2014, at 10:55 PM, "Philbert Rupkins" <
philbertrupkins@gmail.com> wrote:
...
...
...
I'll also mention that I received a response from a gentleman at
NetApp who pointed out the following KB article the recommends reducing the
NFS Queue depth.
...
...
...
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd...
...
...
...
We noticed this KB article but have yet to try it.   We are
considering other options at the moment because the article says this issue
is fixed in the version of ONTAP (8.1.2P4) we are running.   However, if
nothing else pans out, we will give it a shot.
...
...
...
Another note - this is also a highly shared environment in which we
service FCP, CIFS and NFS clients from the same filers (and vfilers) we
service the NFS datastores from.  We have yet to show evidence of high
utilization from the other clients on the same array contributing to the
problem but it is on the radar.
...
...
...
Also worth noting, we are running VSC 4.2.1.   It reports all of the
ESX hosts to be in compliance with the recommended settings.
...
...
...
On Mon, Mar 17, 2014 at 8:30 PM, Philbert Rupkins <
philbertrupkins@gmail.com> wrote:
...
...
...
Hello Toasters,
Anybody have any issues with seemingly random ESXi 5.5 NFS datastore
disconnects during heavy load?
...
...
...
Our Environment:
ESXi 5.5
F3240 ONTAP 8.1.2P4
It doesn't happen all the time.  Only during heavy load but even then
there is no guarantee that it will happen.  We have yet to find a
consistent trigger.
...
...
...
Datastores are mounted via shortname.  We are planning to mount via IP
address to rule out any name resolution issues but that will take some
time.   DNS is generally solid so we are doubtful DNS has anything to do
with it but we should align ourselves with best practices.
...
...
...
We serve all of our NFS through vfilers.     Some of our vfilers host
5 NFS datastores from a single IP address.    I mention this because I have
come across documentation recommending a 1:1 ratio of datastores to IP
addresses.
...
...
...
vmkernel.log just shows that the connection was lost to the NFS
server.   It recovers w/in 10 seconds.    We have 11 nodes in this
particular ESX cluster.
...
...
...
Not all 11 ESXi nodes lose connectivity to the datastore at the same
time.  I've seen it affect just one ESXi node's connectivity to a single
datastore.   I've also seen it affect more than one ESXi node and multiple
datastores on the same filer.
...
...
...
Until recently, it was only observed during storage vmotions.    We
recently discovered it happening during vmotion activity managed by DRS
after a node was brought out of maintenance mode.   As I said before, it is
generally a rare occurrence so it is difficult to trigger on our own.
...
...
...
Thanks in advance for any insight/experiences.
Phil

Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: NFS Datastore Disconnect