Re: NFS Datastore Disconnect

List overview All Threads
Download

newer

older

auth_sys group limit override ?

NFS Datastore Disconnect

Philbert Rupkins

18 Mar 2014 18 Mar '14

5:24 a.m.

Thanks for the response!

Yes. We are running 10g. I know flow control is enabled on the 10g adapters on the NetApps. Not sure if it is enabled on the switches. I'll have to check with our networking team. Do you know if a pause frame would show up somewhere in the port statistics? The switches are Nexus 5K's.

We have been examining TCP Window Sizes during packet traces but have not found anything interesting. Of course, whenever we run a packet capture the problem never occurs so TCP Window Sizes could still be an issue.

On Tue, Mar 18, 2014 at 12:04 AM, Wilkinson, Brent < Brent.Wilkinson@netapp.com> wrote:

...

Are you running 10g? If so what are the flow control settings end to end?

Sent from mobile device.

On Mar 17, 2014, at 10:55 PM, "Philbert Rupkins" < philbertrupkins@gmail.com> wrote:

I'll also mention that I received a response from a gentleman at NetApp who pointed out the following KB article the recommends reducing the NFS Queue depth.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd...

We noticed this KB article but have yet to try it. We are considering other options at the moment because the article says this issue is fixed in the version of ONTAP (8.1.2P4) we are running. However, if nothing else pans out, we will give it a shot.

Another note - this is also a highly shared environment in which we service FCP, CIFS and NFS clients from the same filers (and vfilers) we service the NFS datastores from. We have yet to show evidence of high utilization from the other clients on the same array contributing to the problem but it is on the radar.

Also worth noting, we are running VSC 4.2.1. It reports all of the ESX hosts to be in compliance with the recommended settings.

On Mon, Mar 17, 2014 at 8:30 PM, Philbert Rupkins < philbertrupkins@gmail.com> wrote:

...
Hello Toasters,

Anybody have any issues with seemingly random ESXi 5.5 NFS datastore disconnects during heavy load?

Our Environment:

ESXi 5.5 F3240 ONTAP 8.1.2P4

It doesn't happen all the time. Only during heavy load but even then there is no guarantee that it will happen. We have yet to find a consistent trigger.

Datastores are mounted via shortname. We are planning to mount via IP address to rule out any name resolution issues but that will take some time. DNS is generally solid so we are doubtful DNS has anything to do with it but we should align ourselves with best practices.

We serve all of our NFS through vfilers. Some of our vfilers host 5 NFS datastores from a single IP address. I mention this because I have come across documentation recommending a 1:1 ratio of datastores to IP addresses.

vmkernel.log just shows that the connection was lost to the NFS server. It recovers w/in 10 seconds. We have 11 nodes in this particular ESX cluster.

Not all 11 ESXi nodes lose connectivity to the datastore at the same time. I've seen it affect just one ESXi node's connectivity to a single datastore. I've also seen it affect more than one ESXi node and multiple datastores on the same filer.

Until recently, it was only observed during storage vmotions. We recently discovered it happening during vmotion activity managed by DRS after a node was brought out of maintenance mode. As I said before, it is generally a rare occurrence so it is difficult to trigger on our own.

Thanks in advance for any insight/experiences.

Phil

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Attachments:

attachment.html (text/html — 5.1 KB)

Show replies by date

tmac

18 Mar 18 Mar

10:26 a.m.

New subject: NFS Datastore Disconnect

It would be a fantastic idea to turn off all flow control in bot directions. Let the TCP congestion protocol handle it. That very well could be the issue.

--tmac

*Tim McCarthy* *Principal Consultant*

Clustered ONTAP Clustered ONTAP NCDA ID: XK7R3GEKC1QQ2LVD RHCE6 110-107-141https://www.redhat.com/wapps/training/certification/verify.html?certNumber=110-107-141&isSearch=False&verify=Verify NCSIE ID: C14QPHE21FR4YWD4 Expires: 08 November 2014 Current until Aug 02, 2016 Expires: 08 November 2014

On Tue, Mar 18, 2014 at 1:24 AM, Philbert Rupkins <philbertrupkins@gmail.com

...

wrote:

...

Thanks for the response!

Yes. We are running 10g. I know flow control is enabled on the 10g adapters on the NetApps. Not sure if it is enabled on the switches. I'll have to check with our networking team. Do you know if a pause frame would show up somewhere in the port statistics? The switches are Nexus 5K's.

We have been examining TCP Window Sizes during packet traces but have not found anything interesting. Of course, whenever we run a packet capture the problem never occurs so TCP Window Sizes could still be an issue.

On Tue, Mar 18, 2014 at 12:04 AM, Wilkinson, Brent < Brent.Wilkinson@netapp.com> wrote:

...
Are you running 10g? If so what are the flow control settings end to end?

Sent from mobile device.

On Mar 17, 2014, at 10:55 PM, "Philbert Rupkins" < philbertrupkins@gmail.com> wrote:

I'll also mention that I received a response from a gentleman at NetApp who pointed out the following KB article the recommends reducing the NFS Queue depth.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd...

We noticed this KB article but have yet to try it. We are considering other options at the moment because the article says this issue is fixed in the version of ONTAP (8.1.2P4) we are running. However, if nothing else pans out, we will give it a shot.

Another note - this is also a highly shared environment in which we service FCP, CIFS and NFS clients from the same filers (and vfilers) we service the NFS datastores from. We have yet to show evidence of high utilization from the other clients on the same array contributing to the problem but it is on the radar.

Also worth noting, we are running VSC 4.2.1. It reports all of the ESX hosts to be in compliance with the recommended settings.

On Mon, Mar 17, 2014 at 8:30 PM, Philbert Rupkins < philbertrupkins@gmail.com> wrote:

...
Hello Toasters,

Anybody have any issues with seemingly random ESXi 5.5 NFS datastore disconnects during heavy load?

Our Environment:

ESXi 5.5 F3240 ONTAP 8.1.2P4

It doesn't happen all the time. Only during heavy load but even then there is no guarantee that it will happen. We have yet to find a consistent trigger.

Datastores are mounted via shortname. We are planning to mount via IP address to rule out any name resolution issues but that will take some time. DNS is generally solid so we are doubtful DNS has anything to do with it but we should align ourselves with best practices.

We serve all of our NFS through vfilers. Some of our vfilers host 5 NFS datastores from a single IP address. I mention this because I have come across documentation recommending a 1:1 ratio of datastores to IP addresses.

vmkernel.log just shows that the connection was lost to the NFS server. It recovers w/in 10 seconds. We have 11 nodes in this particular ESX cluster.

Not all 11 ESXi nodes lose connectivity to the datastore at the same time. I've seen it affect just one ESXi node's connectivity to a single datastore. I've also seen it affect more than one ESXi node and multiple datastores on the same filer.

Until recently, it was only observed during storage vmotions. We recently discovered it happening during vmotion activity managed by DRS after a node was brought out of maintenance mode. As I said before, it is generally a rare occurrence so it is difficult to trigger on our own.

Thanks in advance for any insight/experiences.

Phil

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Sebastian Goetze

11:41 a.m.

New subject: NFS Datastore Disconnect

I'll second that!

To quote tr-4068:

...

6.6 Flow Control Overview Modern network equipment and protocols generally handle port congestion better than in the past. While NetApp had previously recommended flow control “send” on ESX hosts and NetApp storage controllers, the *current recommendation, especially with 10GbE equipment, is to disable flow control on ESXi, ** **NetApp FAS, and the switches in between.* With ESXi 5, flow control is not exposed in the vSphere client GUI. The ethtool command sets flow control on a per-interface basis. There are three options for flow control: autoneg, tx, and rx. tx is equivalent to “send” on other devices. Note: With some NIC drivers, including some Intel ® drivers, autoneg must be disabled in the same command line for tx and rx to take effect. ~ # ethtool -A vmnic2 autoneg off rx off tx off ~ # ethtool -a vmnic2 Pause parameters for vmnic2: Autonegotiate: off RX: off TX: off

And the symptoms fit well: disconnecting ("pausing") traffic in a congested scenario - maybe just from the one side - and never receiving a 'unpause' frame, thereby disconnecting the datastore for good.

HTH

Sebastian

On 3/18/2014 11:26 AM, tmac wrote:

...

It would be a fantastic idea to turn off all flow control in bot directions. Let the TCP congestion protocol handle it. That very well could be the issue.

--tmac

*Tim McCarthy* /Principal Consultant/

Clustered ONTAP Clustered ONTAP NCDA ID: XK7R3GEKC1QQ2LVD RHCE6 110-107-141 https://www.redhat.com/wapps/training/certification/verify.html?certNumber=110-107-141&isSearch=False&verify=VerifyNCSIE ID: C14QPHE21FR4YWD4 Expires: 08 November 2014 Current until Aug 02, 2016Expires: 08 November 2014

On Tue, Mar 18, 2014 at 1:24 AM, Philbert Rupkins <philbertrupkins@gmail.com mailto:philbertrupkins@gmail.com> wrote:

Thanks for the response!

Yes.  We are running 10g.  I know flow control is enabled on the
10g adapters on the NetApps.     Not sure if it is enabled on the
switches.     I'll have to check with our networking team.    Do
you know if a pause frame would show up somewhere in the port
statistics?   The switches are Nexus 5K's.

We have been examining TCP Window Sizes during packet traces but
have not found anything interesting.   Of course, whenever we run
a packet capture the problem never occurs so TCP Window Sizes
could still be an issue.


On Tue, Mar 18, 2014 at 12:04 AM, Wilkinson, Brent
<Brent.Wilkinson@netapp.com <mailto:Brent.Wilkinson@netapp.com>>
wrote:

    Are you running 10g? If so what are the flow control settings
    end to end?

    Sent from mobile device.

    On Mar 17, 2014, at 10:55 PM, "Philbert Rupkins"
    <philbertrupkins@gmail.com <mailto:philbertrupkins@gmail.com>>
    wrote:

...

    I'll also mention that I received a response from a gentleman
    at NetApp who pointed out the following KB article the
    recommends reducing the NFS Queue depth.

    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2016122


    We noticed this KB article but have yet to try it.   We are
    considering other options at the moment because the article
    says this issue is fixed in the version of ONTAP (8.1.2P4) we
    are running.   However, if nothing else pans out, we will
    give it a shot.

    Another note - this is also a highly shared environment in
    which we service FCP, CIFS and NFS clients from the same
    filers (and vfilers) we service the NFS datastores from.  We
    have yet to show evidence of high utilization from the other
    clients on the same array contributing to the problem but it
    is on the radar.

    Also worth noting, we are running VSC 4.2.1.   It reports all
    of the ESX hosts to be in compliance with the recommended
    settings.




    On Mon, Mar 17, 2014 at 8:30 PM, Philbert Rupkins
    <philbertrupkins@gmail.com
    <mailto:philbertrupkins@gmail.com>> wrote:

        Hello Toasters,

        Anybody have any issues with seemingly random ESXi 5.5
        NFS datastore disconnects during heavy load?

        Our Environment:

        ESXi 5.5
        F3240 ONTAP 8.1.2P4

        It doesn't happen all the time.  Only during heavy load
        but even then there is no guarantee that it will happen.
         We have yet to find a consistent trigger.

        Datastores are mounted via shortname.  We are planning to
        mount via IP address to rule out any name resolution
        issues but that will take some time.   DNS is generally
        solid so we are doubtful DNS has anything to do with it
        but we should align ourselves with best practices.

        We serve all of our NFS through vfilers.     Some of our
        vfilers host 5 NFS datastores from a single IP address.  
         I mention this because I have come across documentation
        recommending a 1:1 ratio of datastores to IP addresses.

        vmkernel.log just shows that the connection was lost to
        the NFS server. It recovers w/in 10 seconds.    We have
        11 nodes in this particular ESX cluster.

        Not all 11 ESXi nodes lose connectivity to the datastore
        at the same time.  I've seen it affect just one ESXi
        node's connectivity to a single datastore.   I've also
        seen it affect more than one ESXi node and multiple
        datastores on the same filer.

        Until recently, it was only observed during storage
        vmotions.    We recently discovered it happening during
        vmotion activity managed by DRS after a node was brought
        out of maintenance mode. As I said before, it is
        generally a rare occurrence so it is difficult to trigger
        on our own.

        Thanks in advance for any insight/experiences.

        Phil



    _______________________________________________
    Toasters mailing list
    Toasters@teaparty.net <mailto:Toasters@teaparty.net>
    http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________
Toasters mailing list
Toasters@teaparty.net <mailto:Toasters@teaparty.net>
http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Vervloesem Wouter

11:56 a.m.

New subject: NFS Datastore Disconnect

This is strange, because other documents state that flow control should not be enabled :

TR-3802 : Ethernet Storage Best Practices "For these reasons, it’s not recommended to enable flow control throughout the network (including switches, data ports, intracluster ports). ... FLOW CONTROL RECOMMENDATIONS Ensure flow control is disabled on both the storage controller and the switch it is connected to."

Also, in several support cases we were told to disable flow control.

Mvg, Wouter Vervloesem

Neoria - Uptime Group Veldkant 35D B-2550 Kontich

Tel: +32 (0)3 451 23 82 Mailto: wouter.vervloesem@neoria.be Web: http://www.neoria.be

Op 18-mrt.-2014, om 12:41 heeft Sebastian Goetze spgoetze@gmail.com het volgende geschreven:

...

I'll second that!

To quote tr-4068:

...
6.6 Flow Control Overview Modern network equipment and protocols generally handle port congestion better than in the past. While NetApp had previously recommended flow control “send” on ESX hosts and NetApp storage controllers, the current recommendation, especially with 10GbE equipment, is to disable flow control on ESXi, NetApp FAS, and the switches in between. With ESXi 5, flow control is not exposed in the vSphere client GUI. The ethtool command sets flow control on a per-interface basis. There are three options for flow control: autoneg, tx, and rx. tx is equivalent to “send” on other devices. Note: With some NIC drivers, including some Intel ® drivers, autoneg must be disabled in the same command line for tx and rx to take effect. ~ # ethtool -A vmnic2 autoneg off rx off tx off ~ # ethtool -a vmnic2 Pause parameters for vmnic2: Autonegotiate: off RX: off TX: off

And the symptoms fit well: disconnecting ("pausing") traffic in a congested scenario - maybe just from the one side - and never receiving a 'unpause' frame, thereby disconnecting the datastore for good.

HTH

Sebastian

On 3/18/2014 11:26 AM, tmac wrote:

...
It would be a fantastic idea to turn off all flow control in bot directions. Let the TCP congestion protocol handle it. That very well could be the issue.

--tmac

Tim McCarthy Principal Consultant
    Clustered ONTAP                                                        Clustered ONTAP
NCDA ID: XK7R3GEKC1QQ2LVD RHCE6 110-107-141 NCSIE ID: C14QPHE21FR4YWD4 Expires: 08 November 2014 Current until Aug 02, 2016 Expires: 08 November 2014

On Tue, Mar 18, 2014 at 1:24 AM, Philbert Rupkins philbertrupkins@gmail.com wrote: Thanks for the response!

Yes. We are running 10g. I know flow control is enabled on the 10g adapters on the NetApps. Not sure if it is enabled on the switches. I'll have to check with our networking team. Do you know if a pause frame would show up somewhere in the port statistics? The switches are Nexus 5K's.

We have been examining TCP Window Sizes during packet traces but have not found anything interesting. Of course, whenever we run a packet capture the problem never occurs so TCP Window Sizes could still be an issue.

On Tue, Mar 18, 2014 at 12:04 AM, Wilkinson, Brent Brent.Wilkinson@netapp.com wrote: Are you running 10g? If so what are the flow control settings end to end?

Sent from mobile device.

On Mar 17, 2014, at 10:55 PM, "Philbert Rupkins" philbertrupkins@gmail.com wrote:

...
I'll also mention that I received a response from a gentleman at NetApp who pointed out the following KB article the recommends reducing the NFS Queue depth.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd...

We noticed this KB article but have yet to try it. We are considering other options at the moment because the article says this issue is fixed in the version of ONTAP (8.1.2P4) we are running. However, if nothing else pans out, we will give it a shot.

Another note - this is also a highly shared environment in which we service FCP, CIFS and NFS clients from the same filers (and vfilers) we service the NFS datastores from. We have yet to show evidence of high utilization from the other clients on the same array contributing to the problem but it is on the radar.

Also worth noting, we are running VSC 4.2.1. It reports all of the ESX hosts to be in compliance with the recommended settings.

On Mon, Mar 17, 2014 at 8:30 PM, Philbert Rupkins philbertrupkins@gmail.com wrote: Hello Toasters,

Anybody have any issues with seemingly random ESXi 5.5 NFS datastore disconnects during heavy load?

Our Environment:

ESXi 5.5 F3240 ONTAP 8.1.2P4

It doesn't happen all the time. Only during heavy load but even then there is no guarantee that it will happen. We have yet to find a consistent trigger.

Datastores are mounted via shortname. We are planning to mount via IP address to rule out any name resolution issues but that will take some time. DNS is generally solid so we are doubtful DNS has anything to do with it but we should align ourselves with best practices.

We serve all of our NFS through vfilers. Some of our vfilers host 5 NFS datastores from a single IP address. I mention this because I have come across documentation recommending a 1:1 ratio of datastores to IP addresses.

vmkernel.log just shows that the connection was lost to the NFS server. It recovers w/in 10 seconds. We have 11 nodes in this particular ESX cluster.

Not all 11 ESXi nodes lose connectivity to the datastore at the same time. I've seen it affect just one ESXi node's connectivity to a single datastore. I've also seen it affect more than one ESXi node and multiple datastores on the same filer.

Until recently, it was only observed during storage vmotions. We recently discovered it happening during vmotion activity managed by DRS after a node was brought out of maintenance mode. As I said before, it is generally a rare occurrence so it is difficult to trigger on our own.

Thanks in advance for any insight/experiences.

Phil

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list

Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Vervloesem Wouter

11:57 a.m.

New subject: NFS Datastore Disconnect

Sorry, I misread the previous mail. It is indeed best to disable flow control, I just provided a second source to say the same.

again sorry for not reading the mail correct.

Mvg, Wouter Vervloesem

Neoria - Uptime Group Veldkant 35D B-2550 Kontich

Tel: +32 (0)3 451 23 82 Mailto: wouter.vervloesem@neoria.be Web: http://www.neoria.be

Op 18-mrt.-2014, om 12:56 heeft Wouter Vervloesem wouter.vervloesem@neoria.be het volgende geschreven:

...

This is strange, because other documents state that flow control should not be enabled :

TR-3802 : Ethernet Storage Best Practices "For these reasons, it’s not recommended to enable flow control throughout the network (including switches, data ports, intracluster ports). ... FLOW CONTROL RECOMMENDATIONS Ensure flow control is disabled on both the storage controller and the switch it is connected to."

Also, in several support cases we were told to disable flow control.

Mvg, Wouter Vervloesem

Neoria - Uptime Group Veldkant 35D B-2550 Kontich

Tel: +32 (0)3 451 23 82 Mailto: wouter.vervloesem@neoria.be Web: http://www.neoria.be

Op 18-mrt.-2014, om 12:41 heeft Sebastian Goetze spgoetze@gmail.com het volgende geschreven:

...
I'll second that!

To quote tr-4068:

...
6.6 Flow Control Overview Modern network equipment and protocols generally handle port congestion better than in the past. While NetApp had previously recommended flow control “send” on ESX hosts and NetApp storage controllers, the current recommendation, especially with 10GbE equipment, is to disable flow control on ESXi, NetApp FAS, and the switches in between. With ESXi 5, flow control is not exposed in the vSphere client GUI. The ethtool command sets flow control on a per-interface basis. There are three options for flow control: autoneg, tx, and rx. tx is equivalent to “send” on other devices. Note: With some NIC drivers, including some Intel ® drivers, autoneg must be disabled in the same command line for tx and rx to take effect. ~ # ethtool -A vmnic2 autoneg off rx off tx off ~ # ethtool -a vmnic2 Pause parameters for vmnic2: Autonegotiate: off RX: off TX: off

And the symptoms fit well: disconnecting ("pausing") traffic in a congested scenario - maybe just from the one side - and never receiving a 'unpause' frame, thereby disconnecting the datastore for good.

HTH

Sebastian

On 3/18/2014 11:26 AM, tmac wrote:

...
It would be a fantastic idea to turn off all flow control in bot directions. Let the TCP congestion protocol handle it. That very well could be the issue.

--tmac

Tim McCarthy Principal Consultant
   Clustered ONTAP                                                        Clustered ONTAP
NCDA ID: XK7R3GEKC1QQ2LVD RHCE6 110-107-141 NCSIE ID: C14QPHE21FR4YWD4 Expires: 08 November 2014 Current until Aug 02, 2016 Expires: 08 November 2014

On Tue, Mar 18, 2014 at 1:24 AM, Philbert Rupkins philbertrupkins@gmail.com wrote: Thanks for the response!

Yes. We are running 10g. I know flow control is enabled on the 10g adapters on the NetApps. Not sure if it is enabled on the switches. I'll have to check with our networking team. Do you know if a pause frame would show up somewhere in the port statistics? The switches are Nexus 5K's.

We have been examining TCP Window Sizes during packet traces but have not found anything interesting. Of course, whenever we run a packet capture the problem never occurs so TCP Window Sizes could still be an issue.

On Tue, Mar 18, 2014 at 12:04 AM, Wilkinson, Brent Brent.Wilkinson@netapp.com wrote: Are you running 10g? If so what are the flow control settings end to end?

Sent from mobile device.

On Mar 17, 2014, at 10:55 PM, "Philbert Rupkins" philbertrupkins@gmail.com wrote:

...
I'll also mention that I received a response from a gentleman at NetApp who pointed out the following KB article the recommends reducing the NFS Queue depth.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd...

We noticed this KB article but have yet to try it. We are considering other options at the moment because the article says this issue is fixed in the version of ONTAP (8.1.2P4) we are running. However, if nothing else pans out, we will give it a shot.

Another note - this is also a highly shared environment in which we service FCP, CIFS and NFS clients from the same filers (and vfilers) we service the NFS datastores from. We have yet to show evidence of high utilization from the other clients on the same array contributing to the problem but it is on the radar.

Also worth noting, we are running VSC 4.2.1. It reports all of the ESX hosts to be in compliance with the recommended settings.

On Mon, Mar 17, 2014 at 8:30 PM, Philbert Rupkins philbertrupkins@gmail.com wrote: Hello Toasters,

Anybody have any issues with seemingly random ESXi 5.5 NFS datastore disconnects during heavy load?

Our Environment:

ESXi 5.5 F3240 ONTAP 8.1.2P4

It doesn't happen all the time. Only during heavy load but even then there is no guarantee that it will happen. We have yet to find a consistent trigger.

Datastores are mounted via shortname. We are planning to mount via IP address to rule out any name resolution issues but that will take some time. DNS is generally solid so we are doubtful DNS has anything to do with it but we should align ourselves with best practices.

We serve all of our NFS through vfilers. Some of our vfilers host 5 NFS datastores from a single IP address. I mention this because I have come across documentation recommending a 1:1 ratio of datastores to IP addresses.

vmkernel.log just shows that the connection was lost to the NFS server. It recovers w/in 10 seconds. We have 11 nodes in this particular ESX cluster.

Not all 11 ESXi nodes lose connectivity to the datastore at the same time. I've seen it affect just one ESXi node's connectivity to a single datastore. I've also seen it affect more than one ESXi node and multiple datastores on the same filer.

Until recently, it was only observed during storage vmotions. We recently discovered it happening during vmotion activity managed by DRS after a node was brought out of maintenance mode. As I said before, it is generally a rare occurrence so it is difficult to trigger on our own.

Thanks in advance for any insight/experiences.

Phil

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list

Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

tmac

12:01 p.m.

New subject: NFS Datastore Disconnect

That is what we both said....disable flow control.

It was *previously* recommended to use flow control. Not any more. Especially on 10G networks. Disable the Flow Control. Both directions. everywhere.

--tmac

*Tim McCarthy* *Principal Consultant*

On Tue, Mar 18, 2014 at 7:56 AM, Vervloesem Wouter < wouter.vervloesem@neoria.be> wrote:

...

This is strange, because other documents state that flow control should not be enabled :

TR-3802 : Ethernet Storage Best Practices "For these reasons, it's not recommended to enable flow control throughout the network (including switches, data ports, intracluster ports). ... FLOW CONTROL RECOMMENDATIONS Ensure flow control is disabled on both the storage controller and the switch it is connected to."

Also, in several support cases we were told to disable flow control.

Mvg, Wouter Vervloesem

Neoria - Uptime Group Veldkant 35D B-2550 Kontich

Tel: +32 (0)3 451 23 82 Mailto: wouter.vervloesem@neoria.be Web: http://www.neoria.be

Op 18-mrt.-2014, om 12:41 heeft Sebastian Goetze spgoetze@gmail.com het volgende geschreven:

...
I'll second that!

To quote tr-4068:

...
6.6 Flow Control Overview Modern network equipment and protocols generally handle port congestion

better than in the past. While

...
...
NetApp had previously recommended flow control "send" on ESX hosts and

NetApp storage controllers,

...
...
the current recommendation, especially with 10GbE equipment, is to

disable flow control on ESXi,

...
...
NetApp FAS, and the switches in between. With ESXi 5, flow control is not exposed in the vSphere client GUI. The

ethtool command sets flow control

...
...
on a per-interface basis. There are three options for flow control:

autoneg, tx, and rx. tx is equivalent to

...
...
"send" on other devices. Note: With some NIC drivers, including some Intel (R) drivers, autoneg must be disabled in the same command line for tx and rx to take effect. ~ # ethtool -A vmnic2 autoneg off rx off tx off ~ # ethtool -a vmnic2 Pause parameters for vmnic2: Autonegotiate: off RX: off TX: off

And the symptoms fit well: disconnecting ("pausing") traffic in a

congested scenario - maybe just from the one side - and never receiving a 'unpause' frame, thereby disconnecting the datastore for good.

...
HTH

Sebastian

On 3/18/2014 11:26 AM, tmac wrote:

...
It would be a fantastic idea to turn off all flow control in bot

directions. Let the TCP congestion protocol handle it.

...
...
That very well could be the issue.

--tmac

Tim McCarthy Principal Consultant
    Clustered ONTAP
   Clustered ONTAP
...
...
NCDA ID: XK7R3GEKC1QQ2LVD RHCE6 110-107-141 NCSIE

ID: C14QPHE21FR4YWD4

...
...
 Expires: 08 November 2014              Current until Aug 02, 2016
    Expires: 08 November 2014
...
...
On Tue, Mar 18, 2014 at 1:24 AM, Philbert Rupkins <

philbertrupkins@gmail.com> wrote:

...
...
Thanks for the response!

Yes. We are running 10g. I know flow control is enabled on the 10g

adapters on the NetApps. Not sure if it is enabled on the switches. I'll have to check with our networking team. Do you know if a pause frame would show up somewhere in the port statistics? The switches are Nexus 5K's.

...
...
We have been examining TCP Window Sizes during packet traces but have

not found anything interesting. Of course, whenever we run a packet capture the problem never occurs so TCP Window Sizes could still be an issue.

...
...
On Tue, Mar 18, 2014 at 12:04 AM, Wilkinson, Brent <

Brent.Wilkinson@netapp.com> wrote:

...
...
Are you running 10g? If so what are the flow control settings end to

end?

...
...
Sent from mobile device.

On Mar 17, 2014, at 10:55 PM, "Philbert Rupkins" <

philbertrupkins@gmail.com> wrote:

...
...
...
I'll also mention that I received a response from a gentleman at

NetApp who pointed out the following KB article the recommends reducing the NFS Queue depth.

...
...
...
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd...

...
...
...
We noticed this KB article but have yet to try it. We are

considering other options at the moment because the article says this issue is fixed in the version of ONTAP (8.1.2P4) we are running. However, if nothing else pans out, we will give it a shot.

...
...
...
Another note - this is also a highly shared environment in which we

service FCP, CIFS and NFS clients from the same filers (and vfilers) we service the NFS datastores from. We have yet to show evidence of high utilization from the other clients on the same array contributing to the problem but it is on the radar.

...
...
...
Also worth noting, we are running VSC 4.2.1. It reports all of the

ESX hosts to be in compliance with the recommended settings.

...
...
...
On Mon, Mar 17, 2014 at 8:30 PM, Philbert Rupkins <

philbertrupkins@gmail.com> wrote:

...
...
...
Hello Toasters,

Anybody have any issues with seemingly random ESXi 5.5 NFS datastore

disconnects during heavy load?

...
...
...
Our Environment:

ESXi 5.5 F3240 ONTAP 8.1.2P4

It doesn't happen all the time. Only during heavy load but even then

there is no guarantee that it will happen. We have yet to find a consistent trigger.

...
...
...
Datastores are mounted via shortname. We are planning to mount via IP

address to rule out any name resolution issues but that will take some time. DNS is generally solid so we are doubtful DNS has anything to do with it but we should align ourselves with best practices.

...
...
...
We serve all of our NFS through vfilers. Some of our vfilers host

5 NFS datastores from a single IP address. I mention this because I have come across documentation recommending a 1:1 ratio of datastores to IP addresses.

...
...
...
vmkernel.log just shows that the connection was lost to the NFS

server. It recovers w/in 10 seconds. We have 11 nodes in this particular ESX cluster.

...
...
...
Not all 11 ESXi nodes lose connectivity to the datastore at the same

time. I've seen it affect just one ESXi node's connectivity to a single datastore. I've also seen it affect more than one ESXi node and multiple datastores on the same filer.

...
...
...
Until recently, it was only observed during storage vmotions. We

recently discovered it happening during vmotion activity managed by DRS after a node was brought out of maintenance mode. As I said before, it is generally a rare occurrence so it is difficult to trigger on our own.

...
...
...
Thanks in advance for any insight/experiences.

Phil

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list

Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Philbert Rupkins

12:45 p.m.

New subject: NFS Datastore Disconnect

Thanks for the responses everybody. We'll certainly be looking into the flow control settings in a bit more detail.

The disconnect we experience seems to reconnect on its own consistently. It doesnt happen a lot but we typically see the following:

1. One or more all paths down warnings for a particular NFS datastore on an ESX host. Sometimes we see it for more than one datastore on multiple ESX hosts.

2. An informational message stating the datastore has reconnected.

On rare occasions we see the following:

1. One or more all paths down warnings for the NFS datastore. 2. An error stating the datastore has been disconnected 3. An information alert stating the datastore has reconnected.

I just wanted to point this out as one of the responses mentioned a permanent disconnect in the context of flow control.

Thank you for pointing out TR-4068. We have yet to make the leap to cluster mode but I'll be referencing this heavily when we do. I believe similar flow control settings are recommended in TR-3749.

On Tue, Mar 18, 2014 at 7:01 AM, tmac tmacmd@gmail.com wrote:

...

That is what we both said....disable flow control.

It was *previously* recommended to use flow control. Not any more. Especially on 10G networks. Disable the Flow Control. Both directions. everywhere.

--tmac

*Tim McCarthy* *Principal Consultant*
    Clustered ONTAP
 Clustered ONTAP
NCDA ID: XK7R3GEKC1QQ2LVD RHCE6 110-107-141https://www.redhat.com/wapps/training/certification/verify.html?certNumber=110-107-141&isSearch=False&verify=Verify NCSIE ID: C14QPHE21FR4YWD4 Expires: 08 November 2014 Current until Aug 02, 2016 Expires: 08 November 2014

On Tue, Mar 18, 2014 at 7:56 AM, Vervloesem Wouter < wouter.vervloesem@neoria.be> wrote:

...
This is strange, because other documents state that flow control should not be enabled :

TR-3802 : Ethernet Storage Best Practices "For these reasons, it's not recommended to enable flow control throughout the network (including switches, data ports, intracluster ports). ... FLOW CONTROL RECOMMENDATIONS Ensure flow control is disabled on both the storage controller and the switch it is connected to."

Also, in several support cases we were told to disable flow control.

Mvg, Wouter Vervloesem

Neoria - Uptime Group Veldkant 35D B-2550 Kontich

Tel: +32 (0)3 451 23 82 Mailto: wouter.vervloesem@neoria.be Web: http://www.neoria.be

Op 18-mrt.-2014, om 12:41 heeft Sebastian Goetze spgoetze@gmail.com het volgende geschreven:

...
I'll second that!

To quote tr-4068:

...
6.6 Flow Control Overview Modern network equipment and protocols generally handle port

congestion better than in the past. While

...
...
NetApp had previously recommended flow control "send" on ESX hosts and

NetApp storage controllers,

...
...
the current recommendation, especially with 10GbE equipment, is to

disable flow control on ESXi,

...
...
NetApp FAS, and the switches in between. With ESXi 5, flow control is not exposed in the vSphere client GUI.

The ethtool command sets flow control

...
...
on a per-interface basis. There are three options for flow control:

autoneg, tx, and rx. tx is equivalent to

...
...
"send" on other devices. Note: With some NIC drivers, including some Intel (R) drivers, autoneg must be disabled in the same command line for tx and rx to take effect. ~ # ethtool -A vmnic2 autoneg off rx off tx off ~ # ethtool -a vmnic2 Pause parameters for vmnic2: Autonegotiate: off RX: off TX: off

And the symptoms fit well: disconnecting ("pausing") traffic in a

congested scenario - maybe just from the one side - and never receiving a 'unpause' frame, thereby disconnecting the datastore for good.

...
HTH

Sebastian

On 3/18/2014 11:26 AM, tmac wrote:

...
It would be a fantastic idea to turn off all flow control in bot

directions. Let the TCP congestion protocol handle it.

...
...
That very well could be the issue.

--tmac

Tim McCarthy Principal Consultant
    Clustered ONTAP
     Clustered ONTAP
...
...
NCDA ID: XK7R3GEKC1QQ2LVD RHCE6 110-107-141 NCSIE

ID: C14QPHE21FR4YWD4

...
...
 Expires: 08 November 2014              Current until Aug 02, 2016
    Expires: 08 November 2014
...
...
On Tue, Mar 18, 2014 at 1:24 AM, Philbert Rupkins <

philbertrupkins@gmail.com> wrote:

...
...
Thanks for the response!

Yes. We are running 10g. I know flow control is enabled on the 10g

adapters on the NetApps. Not sure if it is enabled on the switches. I'll have to check with our networking team. Do you know if a pause frame would show up somewhere in the port statistics? The switches are Nexus 5K's.

...
...
We have been examining TCP Window Sizes during packet traces but have

not found anything interesting. Of course, whenever we run a packet capture the problem never occurs so TCP Window Sizes could still be an issue.

...
...
On Tue, Mar 18, 2014 at 12:04 AM, Wilkinson, Brent <

Brent.Wilkinson@netapp.com> wrote:

...
...
Are you running 10g? If so what are the flow control settings end to

end?

...
...
Sent from mobile device.

On Mar 17, 2014, at 10:55 PM, "Philbert Rupkins" <

philbertrupkins@gmail.com> wrote:

...
...
...
I'll also mention that I received a response from a gentleman at

NetApp who pointed out the following KB article the recommends reducing the NFS Queue depth.

...
...
...
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd...

...
...
...
We noticed this KB article but have yet to try it. We are

considering other options at the moment because the article says this issue is fixed in the version of ONTAP (8.1.2P4) we are running. However, if nothing else pans out, we will give it a shot.

...
...
...
Another note - this is also a highly shared environment in which we

service FCP, CIFS and NFS clients from the same filers (and vfilers) we service the NFS datastores from. We have yet to show evidence of high utilization from the other clients on the same array contributing to the problem but it is on the radar.

...
...
...
Also worth noting, we are running VSC 4.2.1. It reports all of the

ESX hosts to be in compliance with the recommended settings.

...
...
...
On Mon, Mar 17, 2014 at 8:30 PM, Philbert Rupkins <

philbertrupkins@gmail.com> wrote:

...
...
...
Hello Toasters,

Anybody have any issues with seemingly random ESXi 5.5 NFS datastore

disconnects during heavy load?

...
...
...
Our Environment:

ESXi 5.5 F3240 ONTAP 8.1.2P4

It doesn't happen all the time. Only during heavy load but even then

there is no guarantee that it will happen. We have yet to find a consistent trigger.

...
...
...
Datastores are mounted via shortname. We are planning to mount via

IP address to rule out any name resolution issues but that will take some time. DNS is generally solid so we are doubtful DNS has anything to do with it but we should align ourselves with best practices.

...
...
...
We serve all of our NFS through vfilers. Some of our vfilers host

5 NFS datastores from a single IP address. I mention this because I have come across documentation recommending a 1:1 ratio of datastores to IP addresses.

...
...
...
vmkernel.log just shows that the connection was lost to the NFS

server. It recovers w/in 10 seconds. We have 11 nodes in this particular ESX cluster.

...
...
...
Not all 11 ESXi nodes lose connectivity to the datastore at the same

time. I've seen it affect just one ESXi node's connectivity to a single datastore. I've also seen it affect more than one ESXi node and multiple datastores on the same filer.

...
...
...
Until recently, it was only observed during storage vmotions. We

recently discovered it happening during vmotion activity managed by DRS after a node was brought out of maintenance mode. As I said before, it is generally a rare occurrence so it is difficult to trigger on our own.

...
...
...
Thanks in advance for any insight/experiences.

Phil

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list

Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Bradley, Shane

8:23 p.m.

New subject: NFS Datastore Disconnect

How loaded is the system? You're not getting latency spikes are you?

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Philbert Rupkins Sent: Wednesday, 19 March 2014 1:45 a.m. To: tmac Cc: Wilkinson, Brent; toasters@teaparty.net Subject: Re: NFS Datastore Disconnect

Thanks for the responses everybody. We'll certainly be looking into the flow control settings in a bit more detail.

The disconnect we experience seems to reconnect on its own consistently. It doesnt happen a lot but we typically see the following:

1. One or more all paths down warnings for a particular NFS datastore on an ESX host. Sometimes we see it for more than one datastore on multiple ESX hosts.

2. An informational message stating the datastore has reconnected.

On rare occasions we see the following:

1. One or more all paths down warnings for the NFS datastore. 2. An error stating the datastore has been disconnected 3. An information alert stating the datastore has reconnected.

I just wanted to point this out as one of the responses mentioned a permanent disconnect in the context of flow control.

Thank you for pointing out TR-4068. We have yet to make the leap to cluster mode but I'll be referencing this heavily when we do. I believe similar flow control settings are recommended in TR-3749.

On Tue, Mar 18, 2014 at 7:01 AM, tmac <tmacmd@gmail.commailto:tmacmd@gmail.com> wrote: That is what we both said....disable flow control.

It was *previously* recommended to use flow control. Not any more. Especially on 10G networks. Disable the Flow Control. Both directions. everywhere.

--tmac

Tim McCarthy Principal Consultant

On Tue, Mar 18, 2014 at 7:56 AM, Vervloesem Wouter <wouter.vervloesem@neoria.bemailto:wouter.vervloesem@neoria.be> wrote: This is strange, because other documents state that flow control should not be enabled :

TR-3802 : Ethernet Storage Best Practices "For these reasons, it's not recommended to enable flow control throughout the network (including switches, data ports, intracluster ports). ... FLOW CONTROL RECOMMENDATIONS Ensure flow control is disabled on both the storage controller and the switch it is connected to."

Also, in several support cases we were told to disable flow control.

Mvg, Wouter Vervloesem

Neoria - Uptime Group Veldkant 35D B-2550 Kontich

Tel: +32 (0)3 451 23 82tel:%2B32%20%280%293%20451%2023%2082 Mailto: wouter.vervloesem@neoria.bemailto:wouter.vervloesem@neoria.be Web: http://www.neoria.be

Op 18-mrt.-2014, om 12:41 heeft Sebastian Goetze <spgoetze@gmail.commailto:spgoetze@gmail.com> het volgende geschreven:

...

I'll second that!

To quote tr-4068:

...
6.6 Flow Control Overview Modern network equipment and protocols generally handle port congestion better than in the past. While NetApp had previously recommended flow control "send" on ESX hosts and NetApp storage controllers, the current recommendation, especially with 10GbE equipment, is to disable flow control on ESXi, NetApp FAS, and the switches in between. With ESXi 5, flow control is not exposed in the vSphere client GUI. The ethtool command sets flow control on a per-interface basis. There are three options for flow control: autoneg, tx, and rx. tx is equivalent to "send" on other devices. Note: With some NIC drivers, including some Intel (r) drivers, autoneg must be disabled in the same command line for tx and rx to take effect. ~ # ethtool -A vmnic2 autoneg off rx off tx off ~ # ethtool -a vmnic2 Pause parameters for vmnic2: Autonegotiate: off RX: off TX: off

And the symptoms fit well: disconnecting ("pausing") traffic in a congested scenario - maybe just from the one side - and never receiving a 'unpause' frame, thereby disconnecting the datastore for good.

HTH

Sebastian

On 3/18/2014 11:26 AM, tmac wrote:

...
It would be a fantastic idea to turn off all flow control in bot directions. Let the TCP congestion protocol handle it. That very well could be the issue.

--tmac

Tim McCarthy Principal Consultant
    Clustered ONTAP                                                        Clustered ONTAP
NCDA ID: XK7R3GEKC1QQ2LVD RHCE6 110-107-141 NCSIE ID: C14QPHE21FR4YWD4 Expires: 08 November 2014 Current until Aug 02, 2016 Expires: 08 November 2014

On Tue, Mar 18, 2014 at 1:24 AM, Philbert Rupkins <philbertrupkins@gmail.commailto:philbertrupkins@gmail.com> wrote: Thanks for the response!

Yes. We are running 10g. I know flow control is enabled on the 10g adapters on the NetApps. Not sure if it is enabled on the switches. I'll have to check with our networking team. Do you know if a pause frame would show up somewhere in the port statistics? The switches are Nexus 5K's.

We have been examining TCP Window Sizes during packet traces but have not found anything interesting. Of course, whenever we run a packet capture the problem never occurs so TCP Window Sizes could still be an issue.

On Tue, Mar 18, 2014 at 12:04 AM, Wilkinson, Brent <Brent.Wilkinson@netapp.commailto:Brent.Wilkinson@netapp.com> wrote: Are you running 10g? If so what are the flow control settings end to end?

Sent from mobile device.

On Mar 17, 2014, at 10:55 PM, "Philbert Rupkins" <philbertrupkins@gmail.commailto:philbertrupkins@gmail.com> wrote:

...
I'll also mention that I received a response from a gentleman at NetApp who pointed out the following KB article the recommends reducing the NFS Queue depth.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd...

We noticed this KB article but have yet to try it. We are considering other options at the moment because the article says this issue is fixed in the version of ONTAP (8.1.2P4) we are running. However, if nothing else pans out, we will give it a shot.

Another note - this is also a highly shared environment in which we service FCP, CIFS and NFS clients from the same filers (and vfilers) we service the NFS datastores from. We have yet to show evidence of high utilization from the other clients on the same array contributing to the problem but it is on the radar.

Also worth noting, we are running VSC 4.2.1. It reports all of the ESX hosts to be in compliance with the recommended settings.

On Mon, Mar 17, 2014 at 8:30 PM, Philbert Rupkins <philbertrupkins@gmail.commailto:philbertrupkins@gmail.com> wrote: Hello Toasters,

Anybody have any issues with seemingly random ESXi 5.5 NFS datastore disconnects during heavy load?

Our Environment:

ESXi 5.5 F3240 ONTAP 8.1.2P4

It doesn't happen all the time. Only during heavy load but even then there is no guarantee that it will happen. We have yet to find a consistent trigger.

Datastores are mounted via shortname. We are planning to mount via IP address to rule out any name resolution issues but that will take some time. DNS is generally solid so we are doubtful DNS has anything to do with it but we should align ourselves with best practices.

We serve all of our NFS through vfilers. Some of our vfilers host 5 NFS datastores from a single IP address. I mention this because I have come across documentation recommending a 1:1 ratio of datastores to IP addresses.

vmkernel.log just shows that the connection was lost to the NFS server. It recovers w/in 10 seconds. We have 11 nodes in this particular ESX cluster.

Not all 11 ESXi nodes lose connectivity to the datastore at the same time. I've seen it affect just one ESXi node's connectivity to a single datastore. I've also seen it affect more than one ESXi node and multiple datastores on the same filer.

Until recently, it was only observed during storage vmotions. We recently discovered it happening during vmotion activity managed by DRS after a node was brought out of maintenance mode. As I said before, it is generally a rare occurrence so it is difficult to trigger on our own.

Thanks in advance for any insight/experiences.

Phil

Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list

Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Philbert Rupkins

11:27 p.m.

New subject: NFS Datastore Disconnect

Hi Brent,

The system doesnt seem to be very loaded at the time the issue occurs. No obvious disk or cpu bottlenecks. We are keeping an eye on utilization as we try to generate the issue ourselves.

No obvious latency spikes reported by the array. The client (an ESXi host in this case) does show an increase in latency which I suspect to be the result of the all paths down condition that leads to the disconect.

-Phil

On Tue, Mar 18, 2014 at 3:23 PM, Bradley, Shane < shane.bradley@nz.fujitsu.com> wrote:

...

How loaded is the system? You're not getting latency spikes are you?

*From:* toasters-bounces@teaparty.net [mailto: toasters-bounces@teaparty.net] *On Behalf Of *Philbert Rupkins *Sent:* Wednesday, 19 March 2014 1:45 a.m. *To:* tmac *Cc:* Wilkinson, Brent; toasters@teaparty.net

*Subject:* Re: NFS Datastore Disconnect

Thanks for the responses everybody. We'll certainly be looking into the flow control settings in a bit more detail.

The disconnect we experience seems to reconnect on its own consistently. It doesnt happen a lot but we typically see the following:

One or more all paths down warnings for a particular NFS datastore on

an ESX host. Sometimes we see it for more than one datastore on multiple ESX hosts.

An informational message stating the datastore has reconnected.

On rare occasions we see the following:

One or more all paths down warnings for the NFS datastore.

An error stating the datastore has been disconnected

An information alert stating the datastore has reconnected.

I just wanted to point this out as one of the responses mentioned a permanent disconnect in the context of flow control.

Thank you for pointing out TR-4068. We have yet to make the leap to cluster mode but I'll be referencing this heavily when we do. I believe similar flow control settings are recommended in TR-3749.

On Tue, Mar 18, 2014 at 7:01 AM, tmac tmacmd@gmail.com wrote:

That is what we both said....disable flow control.

It was *previously* recommended to use flow control. Not any more. Especially on 10G networks.

Disable the Flow Control. Both directions. everywhere.

--tmac

*Tim McCarthy*

*Principal Consultant*
    Clustered ONTAP
 Clustered ONTAP
NCDA ID: XK7R3GEKC1QQ2LVD RHCE6 110-107-141https://www.redhat.com/wapps/training/certification/verify.html?certNumber=110-107-141&isSearch=False&verify=Verify NCSIE ID: C14QPHE21FR4YWD4
 Expires: 08 November 2014              Current until Aug 02, 2016
 Expires: 08 November 2014
On Tue, Mar 18, 2014 at 7:56 AM, Vervloesem Wouter < wouter.vervloesem@neoria.be> wrote:

This is strange, because other documents state that flow control should not be enabled :

TR-3802 : Ethernet Storage Best Practices "For these reasons, it's not recommended to enable flow control throughout the network (including switches, data ports, intracluster ports). ... FLOW CONTROL RECOMMENDATIONS Ensure flow control is disabled on both the storage controller and the switch it is connected to."

Also, in several support cases we were told to disable flow control.

Mvg, Wouter Vervloesem

Neoria - Uptime Group Veldkant 35D B-2550 Kontich

Tel: +32 (0)3 451 23 82 Mailto: wouter.vervloesem@neoria.be Web: http://www.neoria.be

Op 18-mrt.-2014, om 12:41 heeft Sebastian Goetze spgoetze@gmail.com het volgende geschreven:

...
I'll second that!

To quote tr-4068:

...
6.6 Flow Control Overview Modern network equipment and protocols generally handle port congestion

better than in the past. While

...
...
NetApp had previously recommended flow control "send" on ESX hosts and

NetApp storage controllers,

...
...
the current recommendation, especially with 10GbE equipment, is to

disable flow control on ESXi,

...
...
NetApp FAS, and the switches in between. With ESXi 5, flow control is not exposed in the vSphere client GUI. The

ethtool command sets flow control

...
...
on a per-interface basis. There are three options for flow control:

autoneg, tx, and rx. tx is equivalent to

...
...
"send" on other devices. Note: With some NIC drivers, including some Intel (R) drivers, autoneg must be disabled in the same command line for tx and rx to take effect. ~ # ethtool -A vmnic2 autoneg off rx off tx off ~ # ethtool -a vmnic2 Pause parameters for vmnic2: Autonegotiate: off RX: off TX: off

And the symptoms fit well: disconnecting ("pausing") traffic in a

congested scenario - maybe just from the one side - and never receiving a 'unpause' frame, thereby disconnecting the datastore for good.

...
HTH

Sebastian

On 3/18/2014 11:26 AM, tmac wrote:

...
It would be a fantastic idea to turn off all flow control in bot

directions. Let the TCP congestion protocol handle it.

...
...
That very well could be the issue.

--tmac

Tim McCarthy Principal Consultant

...
...
    Clustered ONTAP
   Clustered ONTAP
...
...
NCDA ID: XK7R3GEKC1QQ2LVD RHCE6 110-107-141 NCSIE

ID: C14QPHE21FR4YWD4

...
...
 Expires: 08 November 2014              Current until Aug 02, 2016
    Expires: 08 November 2014
...
...
On Tue, Mar 18, 2014 at 1:24 AM, Philbert Rupkins <

philbertrupkins@gmail.com> wrote:

...
...
Thanks for the response!

Yes. We are running 10g. I know flow control is enabled on the 10g

adapters on the NetApps. Not sure if it is enabled on the switches. I'll have to check with our networking team. Do you know if a pause frame would show up somewhere in the port statistics? The switches are Nexus 5K's.

...
...
We have been examining TCP Window Sizes during packet traces but have

not found anything interesting. Of course, whenever we run a packet capture the problem never occurs so TCP Window Sizes could still be an issue.

...
...
On Tue, Mar 18, 2014 at 12:04 AM, Wilkinson, Brent <

Brent.Wilkinson@netapp.com> wrote:

...
...
Are you running 10g? If so what are the flow control settings end to

end?

...
...
Sent from mobile device.

On Mar 17, 2014, at 10:55 PM, "Philbert Rupkins" <

philbertrupkins@gmail.com> wrote:

...
...
...
I'll also mention that I received a response from a gentleman at

NetApp who pointed out the following KB article the recommends reducing the NFS Queue depth.

...
...
...
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd...

...
...
...
We noticed this KB article but have yet to try it. We are

considering other options at the moment because the article says this issue is fixed in the version of ONTAP (8.1.2P4) we are running. However, if nothing else pans out, we will give it a shot.

...
...
...
Another note - this is also a highly shared environment in which we

service FCP, CIFS and NFS clients from the same filers (and vfilers) we service the NFS datastores from. We have yet to show evidence of high utilization from the other clients on the same array contributing to the problem but it is on the radar.

...
...
...
Also worth noting, we are running VSC 4.2.1. It reports all of the

ESX hosts to be in compliance with the recommended settings.

...
...
...
On Mon, Mar 17, 2014 at 8:30 PM, Philbert Rupkins <

philbertrupkins@gmail.com> wrote:

...
...
...
Hello Toasters,

Anybody have any issues with seemingly random ESXi 5.5 NFS datastore

disconnects during heavy load?

...
...
...
Our Environment:

ESXi 5.5 F3240 ONTAP 8.1.2P4

It doesn't happen all the time. Only during heavy load but even then

there is no guarantee that it will happen. We have yet to find a consistent trigger.

...
...
...
Datastores are mounted via shortname. We are planning to mount via IP

address to rule out any name resolution issues but that will take some time. DNS is generally solid so we are doubtful DNS has anything to do with it but we should align ourselves with best practices.

...
...
...
We serve all of our NFS through vfilers. Some of our vfilers host

5 NFS datastores from a single IP address. I mention this because I have come across documentation recommending a 1:1 ratio of datastores to IP addresses.

...
...
...
vmkernel.log just shows that the connection was lost to the NFS

server. It recovers w/in 10 seconds. We have 11 nodes in this particular ESX cluster.

...
...
...
Not all 11 ESXi nodes lose connectivity to the datastore at the same

time. I've seen it affect just one ESXi node's connectivity to a single datastore. I've also seen it affect more than one ESXi node and multiple datastores on the same filer.

...
...
...
Until recently, it was only observed during storage vmotions. We

recently discovered it happening during vmotion activity managed by DRS after a node was brought out of maintenance mode. As I said before, it is generally a rare occurrence so it is difficult to trigger on our own.

...
...
...
Thanks in advance for any insight/experiences.

Phil

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list

Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

4280

Age (days ago)

4280

Last active (days ago)

toasters@lists.teaparty.net

8 comments

5 participants

tags (0)

participants (5)

Bradley, Shane
Philbert Rupkins
Sebastian Goetze
tmac
Vervloesem Wouter