It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.
I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd. From: Fenn, Michael [mailto:fennm@DEShawResearch.com] Sent: Tuesday, January 23, 2018 6:23 PM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com; Mark Saunders Mark.Saunders@pcmsgroup.com; toasters@teaparty.net Subject: Re: NFS issue after upgrading filers to 9.2P2
The messages are not necessarily indicative of a network problem.
The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".
Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.
Thanks, Michael
From: <toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com> Date: Tuesday, January 23, 2018 at 10:38 AM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>, "toasters@teaparty.netmailto:toasters@teaparty.net" <toasters@teaparty.netmailto:toasters@teaparty.net> Subject: RE: NFS issue after upgrading filers to 9.2P2
Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.
I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.
I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.
What are the mount options used, and are you using DNFS?
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders Sent: Tuesday, January 23, 2018 4:29 PM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: NFS issue after upgrading filers to 9.2P2
Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.
In the server messages we get lots of entries for the SVM
Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:01:47 jwukccsbci last message repeated 5 times Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK Jan 23 07:02:07 jwukccsbci last message repeated 5 times Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:02:47 jwukccsbci last message repeated 5 times Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.
The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.
Regards Mark Data Centre Sysadmin Team Managed Services Phone:- 02476 694455 Ext 2567 The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~ The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.
The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.
I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)
A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.
The mount options were kept fairly straight forward as
nfs nolock,_netdev,udp 0 0
and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.
nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0
How would I be able to tell if we are using DNFS ?
I will send you the support details tomorrow when I am back in the office.
Regards
Mark
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] Sent: 23 January 2018 17:29 To: Fenn, Michael; Mark Saunders; toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.
I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.
From: Fenn, Michael [mailto:fennm@DEShawResearch.com] Sent: Tuesday, January 23, 2018 6:23 PM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com; Mark Saunders Mark.Saunders@pcmsgroup.com; toasters@teaparty.net Subject: Re: NFS issue after upgrading filers to 9.2P2
The messages are not necessarily indicative of a network problem.
The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".
Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.
Thanks, Michael
From: <toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com> Date: Tuesday, January 23, 2018 at 10:38 AM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>, "toasters@teaparty.netmailto:toasters@teaparty.net" <toasters@teaparty.netmailto:toasters@teaparty.net> Subject: RE: NFS issue after upgrading filers to 9.2P2
Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.
I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.
I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.
What are the mount options used, and are you using DNFS?
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders Sent: Tuesday, January 23, 2018 4:29 PM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: NFS issue after upgrading filers to 9.2P2
Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.
In the server messages we get lots of entries for the SVM
Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:01:47 jwukccsbci last message repeated 5 times Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK Jan 23 07:02:07 jwukccsbci last message repeated 5 times Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:02:47 jwukccsbci last message repeated 5 times Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.
The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.
Regards Mark Data Centre Sysadmin Team Managed Services Phone:- 02476 694455 Ext 2567 The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~ The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.
The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
I should have asked - is this SAP HANA or something like SAP on an Oracle database?
Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.
The thing about fastpath does ring a few bells.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com] Sent: Tuesday, January 23, 2018 11:18 PM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com; Fenn, Michael fennm@DEShawResearch.com; toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.
I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)
A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.
The mount options were kept fairly straight forward as
nfs nolock,_netdev,udp 0 0
and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.
nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0
How would I be able to tell if we are using DNFS ?
I will send you the support details tomorrow when I am back in the office.
Regards
Mark
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] Sent: 23 January 2018 17:29 To: Fenn, Michael; Mark Saunders; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.
I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.
From: Fenn, Michael [mailto:fennm@DEShawResearch.com] Sent: Tuesday, January 23, 2018 6:23 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: Re: NFS issue after upgrading filers to 9.2P2
The messages are not necessarily indicative of a network problem.
The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".
Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.
Thanks, Michael
From: <toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com> Date: Tuesday, January 23, 2018 at 10:38 AM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>, "toasters@teaparty.netmailto:toasters@teaparty.net" <toasters@teaparty.netmailto:toasters@teaparty.net> Subject: RE: NFS issue after upgrading filers to 9.2P2
Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.
I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.
I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.
What are the mount options used, and are you using DNFS?
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders Sent: Tuesday, January 23, 2018 4:29 PM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: NFS issue after upgrading filers to 9.2P2
Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.
In the server messages we get lots of entries for the SVM
Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:01:47 jwukccsbci last message repeated 5 times Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK Jan 23 07:02:07 jwukccsbci last message repeated 5 times Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:02:47 jwukccsbci last message repeated 5 times Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.
The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.
Regards Mark Data Centre Sysadmin Team Managed Services Phone:- 02476 694455 Ext 2567 The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~ The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.
The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.
https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-...
The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.
From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey Sent: Tuesday, January 23, 2018 5:24 PM To: Mark Saunders Mark.Saunders@pcmsgroup.com; Fenn, Michael fennm@DEShawResearch.com; toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
I should have asked - is this SAP HANA or something like SAP on an Oracle database?
Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.
The thing about fastpath does ring a few bells.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com] Sent: Tuesday, January 23, 2018 11:18 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.
I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)
A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.
The mount options were kept fairly straight forward as
nfs nolock,_netdev,udp 0 0
and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.
nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0
How would I be able to tell if we are using DNFS ?
I will send you the support details tomorrow when I am back in the office.
Regards
Mark
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] Sent: 23 January 2018 17:29 To: Fenn, Michael; Mark Saunders; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.
I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.
From: Fenn, Michael [mailto:fennm@DEShawResearch.com] Sent: Tuesday, January 23, 2018 6:23 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: Re: NFS issue after upgrading filers to 9.2P2
The messages are not necessarily indicative of a network problem.
The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".
Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.
Thanks, Michael
From: <toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com> Date: Tuesday, January 23, 2018 at 10:38 AM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>, "toasters@teaparty.netmailto:toasters@teaparty.net" <toasters@teaparty.netmailto:toasters@teaparty.net> Subject: RE: NFS issue after upgrading filers to 9.2P2
Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.
I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.
I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.
What are the mount options used, and are you using DNFS?
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders Sent: Tuesday, January 23, 2018 4:29 PM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: NFS issue after upgrading filers to 9.2P2
Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.
In the server messages we get lots of entries for the SVM
Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:01:47 jwukccsbci last message repeated 5 times Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK Jan 23 07:02:07 jwukccsbci last message repeated 5 times Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:02:47 jwukccsbci last message repeated 5 times Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.
The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.
Regards Mark Data Centre Sysadmin Team Managed Services Phone:- 02476 694455 Ext 2567 The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~ The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.
The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
This community post also does a good job explaining it:
https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgr...
From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin Sent: Tuesday, January 23, 2018 5:28 PM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com; Mark Saunders Mark.Saunders@pcmsgroup.com; Fenn, Michael fennm@DEShawResearch.com; toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.
https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-...
The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey Sent: Tuesday, January 23, 2018 5:24 PM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
I should have asked - is this SAP HANA or something like SAP on an Oracle database?
Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.
The thing about fastpath does ring a few bells.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com] Sent: Tuesday, January 23, 2018 11:18 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.
I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)
A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.
The mount options were kept fairly straight forward as
nfs nolock,_netdev,udp 0 0
and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.
nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0
How would I be able to tell if we are using DNFS ?
I will send you the support details tomorrow when I am back in the office.
Regards
Mark
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] Sent: 23 January 2018 17:29 To: Fenn, Michael; Mark Saunders; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.
I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.
From: Fenn, Michael [mailto:fennm@DEShawResearch.com] Sent: Tuesday, January 23, 2018 6:23 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: Re: NFS issue after upgrading filers to 9.2P2
The messages are not necessarily indicative of a network problem.
The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".
Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.
Thanks, Michael
From: <toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com> Date: Tuesday, January 23, 2018 at 10:38 AM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>, "toasters@teaparty.netmailto:toasters@teaparty.net" <toasters@teaparty.netmailto:toasters@teaparty.net> Subject: RE: NFS issue after upgrading filers to 9.2P2
Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.
I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.
I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.
What are the mount options used, and are you using DNFS?
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders Sent: Tuesday, January 23, 2018 4:29 PM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: NFS issue after upgrading filers to 9.2P2
Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.
In the server messages we get lots of entries for the SVM
Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:01:47 jwukccsbci last message repeated 5 times Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK Jan 23 07:02:07 jwukccsbci last message repeated 5 times Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:02:47 jwukccsbci last message repeated 5 times Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.
The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.
Regards Mark Data Centre Sysadmin Team Managed Services Phone:- 02476 694455 Ext 2567 The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~ The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.
The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?
“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”
From: Parisi, Justin Sent: Tuesday, January 23, 2018 5:30 PM To: Parisi, Justin Justin.Parisi@netapp.com; Steiner, Jeffrey Jeffrey.Steiner@netapp.com; Mark Saunders Mark.Saunders@pcmsgroup.com; Fenn, Michael fennm@DEShawResearch.com; toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
This community post also does a good job explaining it:
https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgr...
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin Sent: Tuesday, January 23, 2018 5:28 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.
https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-...
The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey Sent: Tuesday, January 23, 2018 5:24 PM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
I should have asked - is this SAP HANA or something like SAP on an Oracle database?
Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.
The thing about fastpath does ring a few bells.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com] Sent: Tuesday, January 23, 2018 11:18 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.
I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)
A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.
The mount options were kept fairly straight forward as
nfs nolock,_netdev,udp 0 0
and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.
nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0
How would I be able to tell if we are using DNFS ?
I will send you the support details tomorrow when I am back in the office.
Regards
Mark
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] Sent: 23 January 2018 17:29 To: Fenn, Michael; Mark Saunders; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.
I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.
From: Fenn, Michael [mailto:fennm@DEShawResearch.com] Sent: Tuesday, January 23, 2018 6:23 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: Re: NFS issue after upgrading filers to 9.2P2
The messages are not necessarily indicative of a network problem.
The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".
Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.
Thanks, Michael
From: <toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com> Date: Tuesday, January 23, 2018 at 10:38 AM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>, "toasters@teaparty.netmailto:toasters@teaparty.net" <toasters@teaparty.netmailto:toasters@teaparty.net> Subject: RE: NFS issue after upgrading filers to 9.2P2
Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.
I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.
I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.
What are the mount options used, and are you using DNFS?
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders Sent: Tuesday, January 23, 2018 4:29 PM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: NFS issue after upgrading filers to 9.2P2
Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.
In the server messages we get lots of entries for the SVM
Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:01:47 jwukccsbci last message repeated 5 times Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK Jan 23 07:02:07 jwukccsbci last message repeated 5 times Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:02:47 jwukccsbci last message repeated 5 times Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.
The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.
Regards Mark Data Centre Sysadmin Team Managed Services Phone:- 02476 694455 Ext 2567 The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~ The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.
The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
I faintly remember a customer or two who had issues with their network that were somewhat remediated by fastpath, and when fastpath went away they got bit by the weirdness in their network config.
Also having udp in the mount options doesn't make sense.
Justin - I thought UDP was totally desupported in cDOT, and it's probably risky to use anyway.
When you finish reading your 250 emails on the subject after you wake up, let us know whether this is SAP HANA or SAP on Oracle.
From: Parisi, Justin Sent: Tuesday, January 23, 2018 11:33 PM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com; Mark Saunders Mark.Saunders@pcmsgroup.com; Fenn, Michael fennm@DEShawResearch.com; toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?
“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”
From: Parisi, Justin Sent: Tuesday, January 23, 2018 5:30 PM To: Parisi, Justin <Justin.Parisi@netapp.commailto:Justin.Parisi@netapp.com>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
This community post also does a good job explaining it:
https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgr...
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin Sent: Tuesday, January 23, 2018 5:28 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.
https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-...
The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey Sent: Tuesday, January 23, 2018 5:24 PM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
I should have asked - is this SAP HANA or something like SAP on an Oracle database?
Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.
The thing about fastpath does ring a few bells.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com] Sent: Tuesday, January 23, 2018 11:18 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.
I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)
A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.
The mount options were kept fairly straight forward as
nfs nolock,_netdev,udp 0 0
and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.
nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0
How would I be able to tell if we are using DNFS ?
I will send you the support details tomorrow when I am back in the office.
Regards
Mark
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] Sent: 23 January 2018 17:29 To: Fenn, Michael; Mark Saunders; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.
I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.
From: Fenn, Michael [mailto:fennm@DEShawResearch.com] Sent: Tuesday, January 23, 2018 6:23 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: Re: NFS issue after upgrading filers to 9.2P2
The messages are not necessarily indicative of a network problem.
The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".
Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.
Thanks, Michael
From: <toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com> Date: Tuesday, January 23, 2018 at 10:38 AM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>, "toasters@teaparty.netmailto:toasters@teaparty.net" <toasters@teaparty.netmailto:toasters@teaparty.net> Subject: RE: NFS issue after upgrading filers to 9.2P2
Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.
I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.
I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.
What are the mount options used, and are you using DNFS?
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders Sent: Tuesday, January 23, 2018 4:29 PM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: NFS issue after upgrading filers to 9.2P2
Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.
In the server messages we get lots of entries for the SVM
Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:01:47 jwukccsbci last message repeated 5 times Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK Jan 23 07:02:07 jwukccsbci last message repeated 5 times Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:02:47 jwukccsbci last message repeated 5 times Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.
The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.
Regards Mark Data Centre Sysadmin Team Managed Services Phone:- 02476 694455 Ext 2567 The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~ The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.
The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
The is SAP on Oracle
I have found that on our production servers there is a redhat kernel bug so a network restart has been put into the boot sequence we are going to replicate that on one of the servers that is having issues.
We were using udp for the mount options as it was giving better performance than tcp we have put in the things to test today changing it back to tcp.
Regards
Mark
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] Sent: 23 January 2018 22:37 To: Parisi, Justin; Mark Saunders; Fenn, Michael; toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
I faintly remember a customer or two who had issues with their network that were somewhat remediated by fastpath, and when fastpath went away they got bit by the weirdness in their network config.
Also having udp in the mount options doesn't make sense.
Justin - I thought UDP was totally desupported in cDOT, and it's probably risky to use anyway.
When you finish reading your 250 emails on the subject after you wake up, let us know whether this is SAP HANA or SAP on Oracle.
From: Parisi, Justin Sent: Tuesday, January 23, 2018 11:33 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?
“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”
From: Parisi, Justin Sent: Tuesday, January 23, 2018 5:30 PM To: Parisi, Justin <Justin.Parisi@netapp.commailto:Justin.Parisi@netapp.com>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
This community post also does a good job explaining it:
https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgr...
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin Sent: Tuesday, January 23, 2018 5:28 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.
https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-...
The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey Sent: Tuesday, January 23, 2018 5:24 PM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
I should have asked - is this SAP HANA or something like SAP on an Oracle database?
Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.
The thing about fastpath does ring a few bells.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com] Sent: Tuesday, January 23, 2018 11:18 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.
I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)
A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.
The mount options were kept fairly straight forward as
nfs nolock,_netdev,udp 0 0
and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.
nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0
How would I be able to tell if we are using DNFS ?
I will send you the support details tomorrow when I am back in the office.
Regards
Mark
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] Sent: 23 January 2018 17:29 To: Fenn, Michael; Mark Saunders; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.
I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.
From: Fenn, Michael [mailto:fennm@DEShawResearch.com] Sent: Tuesday, January 23, 2018 6:23 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: Re: NFS issue after upgrading filers to 9.2P2
The messages are not necessarily indicative of a network problem.
The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".
Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.
Thanks, Michael
From: <toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com> Date: Tuesday, January 23, 2018 at 10:38 AM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>, "toasters@teaparty.netmailto:toasters@teaparty.net" <toasters@teaparty.netmailto:toasters@teaparty.net> Subject: RE: NFS issue after upgrading filers to 9.2P2
Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.
I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.
I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.
What are the mount options used, and are you using DNFS?
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders Sent: Tuesday, January 23, 2018 4:29 PM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: NFS issue after upgrading filers to 9.2P2
Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.
In the server messages we get lots of entries for the SVM
Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:01:47 jwukccsbci last message repeated 5 times Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK Jan 23 07:02:07 jwukccsbci last message repeated 5 times Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:02:47 jwukccsbci last message repeated 5 times Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.
The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.
Regards Mark Data Centre Sysadmin Team Managed Services Phone:- 02476 694455 Ext 2567 The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~ The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.
The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
What's the bug number?
I can't find an ASUP in the system, but if the problem persists you can run "node run local netstat -sp tcp" and send output. That might indicate whether flow control is happening.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com] Sent: Wednesday, January 24, 2018 12:48 PM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com; Parisi, Justin Justin.Parisi@netapp.com; Fenn, Michael fennm@DEShawResearch.com; toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
The is SAP on Oracle
I have found that on our production servers there is a redhat kernel bug so a network restart has been put into the boot sequence we are going to replicate that on one of the servers that is having issues.
We were using udp for the mount options as it was giving better performance than tcp we have put in the things to test today changing it back to tcp.
Regards
Mark
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] Sent: 23 January 2018 22:37 To: Parisi, Justin; Mark Saunders; Fenn, Michael; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
I faintly remember a customer or two who had issues with their network that were somewhat remediated by fastpath, and when fastpath went away they got bit by the weirdness in their network config.
Also having udp in the mount options doesn't make sense.
Justin - I thought UDP was totally desupported in cDOT, and it's probably risky to use anyway.
When you finish reading your 250 emails on the subject after you wake up, let us know whether this is SAP HANA or SAP on Oracle.
From: Parisi, Justin Sent: Tuesday, January 23, 2018 11:33 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?
“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”
From: Parisi, Justin Sent: Tuesday, January 23, 2018 5:30 PM To: Parisi, Justin <Justin.Parisi@netapp.commailto:Justin.Parisi@netapp.com>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
This community post also does a good job explaining it:
https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgr...
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin Sent: Tuesday, January 23, 2018 5:28 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.
https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-...
The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey Sent: Tuesday, January 23, 2018 5:24 PM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
I should have asked - is this SAP HANA or something like SAP on an Oracle database?
Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.
The thing about fastpath does ring a few bells.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com] Sent: Tuesday, January 23, 2018 11:18 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.
I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)
A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.
The mount options were kept fairly straight forward as
nfs nolock,_netdev,udp 0 0
and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.
nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0
How would I be able to tell if we are using DNFS ?
I will send you the support details tomorrow when I am back in the office.
Regards
Mark
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] Sent: 23 January 2018 17:29 To: Fenn, Michael; Mark Saunders; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.
I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.
From: Fenn, Michael [mailto:fennm@DEShawResearch.com] Sent: Tuesday, January 23, 2018 6:23 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: Re: NFS issue after upgrading filers to 9.2P2
The messages are not necessarily indicative of a network problem.
The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".
Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.
Thanks, Michael
From: <toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com> Date: Tuesday, January 23, 2018 at 10:38 AM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>, "toasters@teaparty.netmailto:toasters@teaparty.net" <toasters@teaparty.netmailto:toasters@teaparty.net> Subject: RE: NFS issue after upgrading filers to 9.2P2
Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.
I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.
I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.
What are the mount options used, and are you using DNFS?
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders Sent: Tuesday, January 23, 2018 4:29 PM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: NFS issue after upgrading filers to 9.2P2
Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.
In the server messages we get lots of entries for the SVM
Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:01:47 jwukccsbci last message repeated 5 times Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK Jan 23 07:02:07 jwukccsbci last message repeated 5 times Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:02:47 jwukccsbci last message repeated 5 times Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.
The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.
Regards Mark Data Centre Sysadmin Team Managed Services Phone:- 02476 694455 Ext 2567 The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~ The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.
The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
Justin
I have just checked the SVM and there are no admin/management interfaces configured for it there are three data lifs for different vlans. I have checked through our other systems this morning and there are no issues in vmware (5.5) or SLES 11/12 so this is just with the redhat servers.
I have checked the interfaces at the server end and it is not showing errors or dropped packets. On the filer end we have 4 physical ports in an interface group with vlans on top. I have run “statistics start –obeject nfs_exports_access_cache” which when checked doesn’t report any errors.
On the server interface
eth1 Link encap:Ethernet HWaddr 00:50:56:A5:0D:6A inet addr:10.240.1.30 Bcast:10.240.1.31 Mask:255.255.255.224 inet6 addr: fe80::250:56ff:fea5:d6a/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:127209 errors:0 dropped:0 overruns:0 frame:0 TX packets:26100 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:104158360 (99.3 MiB) TX bytes:14489402 (13.8 MiB)
While investigating we have found that the file system is fine just after a reboot and you can ls each mount so they are initially all OK. It is when starting the application so putting a bigger load over the network that the file systems stop responding.
Regards
Mark
From: Parisi, Justin [mailto:Justin.Parisi@netapp.com] Sent: 23 January 2018 22:33 To: Steiner, Jeffrey; Mark Saunders; Fenn, Michael; toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?
“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”
From: Parisi, Justin Sent: Tuesday, January 23, 2018 5:30 PM To: Parisi, Justin <Justin.Parisi@netapp.commailto:Justin.Parisi@netapp.com>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
This community post also does a good job explaining it:
https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgr...
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin Sent: Tuesday, January 23, 2018 5:28 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.
https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-...
The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey Sent: Tuesday, January 23, 2018 5:24 PM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
I should have asked - is this SAP HANA or something like SAP on an Oracle database?
Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.
The thing about fastpath does ring a few bells.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com] Sent: Tuesday, January 23, 2018 11:18 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.
I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)
A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.
The mount options were kept fairly straight forward as
nfs nolock,_netdev,udp 0 0
and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.
nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0
How would I be able to tell if we are using DNFS ?
I will send you the support details tomorrow when I am back in the office.
Regards
Mark
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] Sent: 23 January 2018 17:29 To: Fenn, Michael; Mark Saunders; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.
I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.
From: Fenn, Michael [mailto:fennm@DEShawResearch.com] Sent: Tuesday, January 23, 2018 6:23 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: Re: NFS issue after upgrading filers to 9.2P2
The messages are not necessarily indicative of a network problem.
The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".
Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.
Thanks, Michael
From: <toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com> Date: Tuesday, January 23, 2018 at 10:38 AM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>, "toasters@teaparty.netmailto:toasters@teaparty.net" <toasters@teaparty.netmailto:toasters@teaparty.net> Subject: RE: NFS issue after upgrading filers to 9.2P2
Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.
I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.
I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.
What are the mount options used, and are you using DNFS?
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders Sent: Tuesday, January 23, 2018 4:29 PM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: NFS issue after upgrading filers to 9.2P2
Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.
In the server messages we get lots of entries for the SVM
Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:01:47 jwukccsbci last message repeated 5 times Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK Jan 23 07:02:07 jwukccsbci last message repeated 5 times Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:02:47 jwukccsbci last message repeated 5 times Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.
The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.
Regards Mark Data Centre Sysadmin Team Managed Services Phone:- 02476 694455 Ext 2567 The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~ The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.
The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
Could this be TCP slot tables? Flow control capabilities on ONTAP continue to improve. If you don't have TCP slot tables capped at 128 you could see quasi-hangs like this.
Complete details are in TR-3633, but these are the two that you want to watch:
[root@stlrx300s7-145 mkdb]# sysctl -a | grep slot sunrpc.tcp_max_slot_table_entries = 128 sunrpc.tcp_slot_table_entries = 128
Newer versions of linux will allow a ridiculous number of unacknowledged RPC operations to build up. The result can be sending ONTAP into a flow control mode until the OS catches up. We see problems mostly in slow clients. For example, if you're trying to read a lot of data from a host with 1Gb connectivity on a high-end ONTAP system the OS can ask for data quicker than it can process the responses.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com] Sent: Wednesday, January 24, 2018 12:26 PM To: Parisi, Justin Justin.Parisi@netapp.com; Steiner, Jeffrey Jeffrey.Steiner@netapp.com; Fenn, Michael fennm@DEShawResearch.com; toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Justin
I have just checked the SVM and there are no admin/management interfaces configured for it there are three data lifs for different vlans. I have checked through our other systems this morning and there are no issues in vmware (5.5) or SLES 11/12 so this is just with the redhat servers.
I have checked the interfaces at the server end and it is not showing errors or dropped packets. On the filer end we have 4 physical ports in an interface group with vlans on top. I have run “statistics start –obeject nfs_exports_access_cache” which when checked doesn’t report any errors.
On the server interface
eth1 Link encap:Ethernet HWaddr 00:50:56:A5:0D:6A inet addr:10.240.1.30 Bcast:10.240.1.31 Mask:255.255.255.224 inet6 addr: fe80::250:56ff:fea5:d6a/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:127209 errors:0 dropped:0 overruns:0 frame:0 TX packets:26100 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:104158360 (99.3 MiB) TX bytes:14489402 (13.8 MiB)
While investigating we have found that the file system is fine just after a reboot and you can ls each mount so they are initially all OK. It is when starting the application so putting a bigger load over the network that the file systems stop responding.
Regards
Mark
From: Parisi, Justin [mailto:Justin.Parisi@netapp.com] Sent: 23 January 2018 22:33 To: Steiner, Jeffrey; Mark Saunders; Fenn, Michael; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?
“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”
From: Parisi, Justin Sent: Tuesday, January 23, 2018 5:30 PM To: Parisi, Justin <Justin.Parisi@netapp.commailto:Justin.Parisi@netapp.com>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
This community post also does a good job explaining it:
https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgr...
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin Sent: Tuesday, January 23, 2018 5:28 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.
https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-...
The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey Sent: Tuesday, January 23, 2018 5:24 PM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
I should have asked - is this SAP HANA or something like SAP on an Oracle database?
Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.
The thing about fastpath does ring a few bells.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com] Sent: Tuesday, January 23, 2018 11:18 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.
I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)
A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.
The mount options were kept fairly straight forward as
nfs nolock,_netdev,udp 0 0
and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.
nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0
How would I be able to tell if we are using DNFS ?
I will send you the support details tomorrow when I am back in the office.
Regards
Mark
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] Sent: 23 January 2018 17:29 To: Fenn, Michael; Mark Saunders; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.
I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.
From: Fenn, Michael [mailto:fennm@DEShawResearch.com] Sent: Tuesday, January 23, 2018 6:23 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: Re: NFS issue after upgrading filers to 9.2P2
The messages are not necessarily indicative of a network problem.
The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".
Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.
Thanks, Michael
From: <toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com> Date: Tuesday, January 23, 2018 at 10:38 AM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>, "toasters@teaparty.netmailto:toasters@teaparty.net" <toasters@teaparty.netmailto:toasters@teaparty.net> Subject: RE: NFS issue after upgrading filers to 9.2P2
Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.
I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.
I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.
What are the mount options used, and are you using DNFS?
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders Sent: Tuesday, January 23, 2018 4:29 PM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: NFS issue after upgrading filers to 9.2P2
Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.
In the server messages we get lots of entries for the SVM
Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:01:47 jwukccsbci last message repeated 5 times Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK Jan 23 07:02:07 jwukccsbci last message repeated 5 times Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:02:47 jwukccsbci last message repeated 5 times Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.
The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.
Regards Mark Data Centre Sysadmin Team Managed Services Phone:- 02476 694455 Ext 2567 The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~ The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.
The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
I will try to find the kernel bug number as I cant see it in the documentation for the server there is just the following note.
RHEL 5.11 has a bug where NFS mounts mounted after network initialization at boot run with an increased number of TCP requests (approx 10x more) which causes rpc backlog and restricts network throughput on the NFS mounts.
To resolve this a script has been created to restart the networking before the NFS mounts are mounted by netfs at boot. By default netfs runs at boot s25 on runlevel 3, 4 and 5 so we will set the NFS fix to run at s24 on the same run levels.
PGUKCSTGCL01::*> node run -node PGUKCSTGCL01-01 -command netstat -sp tcp ---- Default IPSpace ---- tcp: 900103907 packets sent 476280230 data packets (4676048494764 bytes) 61984 data packets (82328048 bytes) retransmitted 2065 data packets unnecessarily retransmitted 0 resends initiated by MTU discovery 235945463 ack-only packets (517654 delayed) 0 URG only packets 0 window probe packets 187429557 window update packets 333130 control packets 1097649475 packets received 399065895 acks (for 4676054895668 bytes) 2174268 duplicate acks 0 acks for unsent data 723809875 packets (4886339861169 bytes) received in-sequence 1649638 completely duplicate packets (98637034 bytes) 2 old duplicate packets 990 packets with some dup. data (214519 bytes duped) 10872239 out-of-order packets (15192422547 bytes) 0 packets (0 bytes) of data after window 0 window probes 26845 window update packets 2 packets received after close 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 37581 discarded due to memory problems 1441 connection requests 412966 connection accepts 0 bad connection attempts 0 listen queue overflows 305109 ignored RSTs in the windows 414403 connections established (including accepts) 443890 connections closed (including 139609 drops) 151376 connections updated cached RTT on close 151388 connections updated cached RTT variance on close 140203 connections updated cached ssthresh on close 0 embryonic connections dropped 388403781 segments updated rtt (of 258539924 attempts) 6843 retransmit timeouts 11 connections dropped by rexmit timeout 3 persist timeouts 0 connections dropped by persist timeout 0 Connections (fin_wait_2) dropped because of timeout 92323 keepalive timeouts 92323 keepalive probes sent 0 connections dropped by keepalive 351415606 correct ACK header predictions 684179955 correct data packet header predictions 412966 syncache entries added 155 retransmitted 302 dupsyn 0 dropped 412966 completed 0 bucket overflow 0 cache overflow 0 reset 0 stale 0 aborted 0 badack 0 unreach 0 zone failures 412966 cookies sent 0 cookies received 112 hostcache entries added 0 bucket overflow 16181 SACK recovery episodes 51541 segment rexmits in SACK recovery episodes 70735551 byte rexmits in SACK recovery episodes 277116 SACK options (SACK blocks) received 11457931 SACK options (SACK blocks) sent 0 SACK scoreboard overflow 0 packets with ECN CE bit set 0 packets with ECN ECT(0) bit set 0 packets with ECN ECT(1) bit set 0 successful ECN handshakes 0 times ECN reduced the congestion window 251543 times in ONTAP flow control 0 times exited ONTAP flow control 0 times in ONTAP flow control for zero send window 251543 times in ONTAP flow control for non-zero send window 0 connection resets due to ONTAP extreme flow control 0 times in ONTAP extreme flow control 0 is the maximum flow control reset threshold reached during receive 4 is the maximum flow control reset threshold reached during send 0 bytes is send buffer value during last reset 0 bytes is send buffer hiwat mark during last reset 79 times the receive window was closed 44 dropped due to flowcontrol 188382441 segments sent using TSO 4595103991390 bytes sent using TSO 73883767 TSO segments truncated 1069 TSO wrapped sequence space segments 0 segments sent using TSO6 0 bytes sent using TSO6 0 TSO6 segments truncated 0 TSO6 wrapped sequence space segments 366670238 recv upcalls batched in HP 302647105 recv upcalls made in HP 296877004 recv upcalls made in HP because of PSH 2291336 recv upcalls made in HP because of sb_hiwat 3481239 recv upcalls made in HP because of both PSH and sb_hiwat 6733214 recv upcall batch timeouts 16594187 times recv upcall read partial sb_cc in HP 631681762 segments received using LRO 4816721023400 bytes received using LRO 0 segments received using LRO6 0 bytes received using LRO6 ---- ANYVSERVER IPSpace ---- tcp: 0 packets sent 0 data packets (0 bytes) 0 data packets (0 bytes) retransmitted 0 data packets unnecessarily retransmitted 0 resends initiated by MTU discovery 0 ack-only packets (0 delayed) 0 URG only packets 0 window probe packets 0 window update packets 0 control packets 0 packets received 0 acks (for 0 bytes) 0 duplicate acks 0 acks for unsent data 0 packets (0 bytes) received in-sequence 0 completely duplicate packets (0 bytes) 0 old duplicate packets 0 packets with some dup. data (0 bytes duped) 0 out-of-order packets (0 bytes) 0 packets (0 bytes) of data after window 0 window probes 0 window update packets 0 packets received after close 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 0 discarded due to memory problems 0 connection requests 0 connection accepts 0 bad connection attempts 0 listen queue overflows 0 ignored RSTs in the windows 0 connections established (including accepts) 7 connections closed (including 0 drops) 0 connections updated cached RTT on close 0 connections updated cached RTT variance on close 0 connections updated cached ssthresh on close 0 embryonic connections dropped 0 segments updated rtt (of 0 attempts) 0 retransmit timeouts 0 connections dropped by rexmit timeout 0 persist timeouts 0 connections dropped by persist timeout 0 Connections (fin_wait_2) dropped because of timeout 0 keepalive timeouts 0 keepalive probes sent 0 connections dropped by keepalive 0 correct ACK header predictions 0 correct data packet header predictions 0 syncache entries added 0 retransmitted 0 dupsyn 0 dropped 0 completed 0 bucket overflow 0 cache overflow 0 reset 0 stale 0 aborted 0 badack 0 unreach 0 zone failures 0 cookies sent 0 cookies received 0 hostcache entries added 0 bucket overflow 0 SACK recovery episodes 0 segment rexmits in SACK recovery episodes 0 byte rexmits in SACK recovery episodes 0 SACK options (SACK blocks) received 0 SACK options (SACK blocks) sent 0 SACK scoreboard overflow 0 packets with ECN CE bit set 0 packets with ECN ECT(0) bit set 0 packets with ECN ECT(1) bit set 0 successful ECN handshakes 0 times ECN reduced the congestion window 0 times in ONTAP flow control 0 times exited ONTAP flow control 0 times in ONTAP flow control for zero send window 0 times in ONTAP flow control for non-zero send window 0 connection resets due to ONTAP extreme flow control 0 times in ONTAP extreme flow control 0 is the maximum flow control reset threshold reached during receive 0 is the maximum flow control reset threshold reached during send 0 bytes is send buffer value during last reset 0 bytes is send buffer hiwat mark during last reset 0 times the receive window was closed 0 dropped due to flowcontrol 0 segments sent using TSO 0 bytes sent using TSO 0 TSO segments truncated 0 TSO wrapped sequence space segments 0 segments sent using TSO6 0 bytes sent using TSO6 0 TSO6 segments truncated 0 TSO6 wrapped sequence space segments 0 recv upcalls batched in HP 0 recv upcalls made in HP 0 recv upcalls made in HP because of PSH 0 recv upcalls made in HP because of sb_hiwat 0 recv upcalls made in HP because of both PSH and sb_hiwat 0 recv upcall batch timeouts 0 times recv upcall read partial sb_cc in HP 0 segments received using LRO 0 bytes received using LRO 0 segments received using LRO6 0 bytes received using LRO6 ---- Cluster IPSpace ---- tcp: 350960787 packets sent 253625385 data packets (2042642509989 bytes) 11525 data packets (120517203 bytes) retransmitted 63 data packets unnecessarily retransmitted 0 resends initiated by MTU discovery 38550609 ack-only packets (15348627 delayed) 0 URG only packets 1 window probe packet 56728197 window update packets 2035396 control packets 341097715 packets received 224460892 acks (for 2042726150883 bytes) 6840725 duplicate acks 0 acks for unsent data 271870811 packets (3031038679110 bytes) received in-sequence 195650 completely duplicate packets (4506 bytes) 49 old duplicate packets 0 packets with some dup. data (0 bytes duped) 205398 out-of-order packets (565766073 bytes) 0 packets (0 bytes) of data after window 0 window probes 2011210 window update packets 123 packets received after close 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 0 discarded due to memory problems 923539 connection requests 456892 connection accepts 0 bad connection attempts 0 listen queue overflows 529 ignored RSTs in the windows 1271558 connections established (including accepts) 1379180 connections closed (including 1101 drops) 369895 connections updated cached RTT on close 370750 connections updated cached RTT variance on close 12122 connections updated cached ssthresh on close 108207 embryonic connections dropped 224454663 segments updated rtt (of 207849890 attempts) 48471 retransmit timeouts 14 connections dropped by rexmit timeout 1 persist timeout 0 connections dropped by persist timeout 0 Connections (fin_wait_2) dropped because of timeout 152128 keepalive timeouts 147328 keepalive probes sent 4800 connections dropped by keepalive 45057764 correct ACK header predictions 104981779 correct data packet header predictions 457057 syncache entries added 61 retransmitted 0 dupsyn 0 dropped 456892 completed 0 bucket overflow 0 cache overflow 165 reset 0 stale 0 aborted 0 badack 0 unreach 0 zone failures 457057 cookies sent 0 cookies received 61 hostcache entries added 0 bucket overflow 1684 SACK recovery episodes 2491 segment rexmits in SACK recovery episodes 5618157 byte rexmits in SACK recovery episodes 17518 SACK options (SACK blocks) received 86946 SACK options (SACK blocks) sent 0 SACK scoreboard overflow 0 packets with ECN CE bit set 0 packets with ECN ECT(0) bit set 0 packets with ECN ECT(1) bit set 0 successful ECN handshakes 0 times ECN reduced the congestion window 0 times in ONTAP flow control 0 times exited ONTAP flow control 0 times in ONTAP flow control for zero send window 0 times in ONTAP flow control for non-zero send window 0 connection resets due to ONTAP extreme flow control 0 times in ONTAP extreme flow control 0 is the maximum flow control reset threshold reached during receive 0 is the maximum flow control reset threshold reached during send 0 bytes is send buffer value during last reset 0 bytes is send buffer hiwat mark during last reset 0 times the receive window was closed 0 dropped due to flowcontrol 56607835 segments sent using TSO 1679494142753 bytes sent using TSO 36473474 TSO segments truncated 394 TSO wrapped sequence space segments 0 segments sent using TSO6 0 bytes sent using TSO6 0 TSO6 segments truncated 0 TSO6 wrapped sequence space segments 4879278 recv upcalls batched in HP 90401291 recv upcalls made in HP 90401967 recv upcalls made in HP because of PSH 52 recv upcalls made in HP because of sb_hiwat 325 recv upcalls made in HP because of both PSH and sb_hiwat 32882 recv upcall batch timeouts 524 times recv upcall read partial sb_cc in HP 160827213 segments received using LRO 2789346524807 bytes received using LRO 0 segments received using LRO6 0 bytes received using LRO6 ---- ips_4294967289 IPSpace ---- tcp: 0 packets sent 0 data packets (0 bytes) 0 data packets (0 bytes) retransmitted 0 data packets unnecessarily retransmitted 0 resends initiated by MTU discovery 0 ack-only packets (0 delayed) 0 URG only packets 0 window probe packets 0 window update packets 0 control packets 0 packets received 0 acks (for 0 bytes) 0 duplicate acks 0 acks for unsent data 0 packets (0 bytes) received in-sequence 0 completely duplicate packets (0 bytes) 0 old duplicate packets 0 packets with some dup. data (0 bytes duped) 0 out-of-order packets (0 bytes) 0 packets (0 bytes) of data after window 0 window probes 0 window update packets 0 packets received after close 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 0 discarded due to memory problems 0 connection requests 0 connection accepts 0 bad connection attempts 0 listen queue overflows 0 ignored RSTs in the windows 0 connections established (including accepts) 0 connections closed (including 0 drops) 0 connections updated cached RTT on close 0 connections updated cached RTT variance on close 0 connections updated cached ssthresh on close 0 embryonic connections dropped 0 segments updated rtt (of 0 attempts) 0 retransmit timeouts 0 connections dropped by rexmit timeout 0 persist timeouts 0 connections dropped by persist timeout 0 Connections (fin_wait_2) dropped because of timeout 0 keepalive timeouts 0 keepalive probes sent 0 connections dropped by keepalive 0 correct ACK header predictions 0 correct data packet header predictions 0 syncache entries added 0 retransmitted 0 dupsyn 0 dropped 0 completed 0 bucket overflow 0 cache overflow 0 reset 0 stale 0 aborted 0 badack 0 unreach 0 zone failures 0 cookies sent 0 cookies received 0 hostcache entries added 0 bucket overflow 0 SACK recovery episodes 0 segment rexmits in SACK recovery episodes 0 byte rexmits in SACK recovery episodes 0 SACK options (SACK blocks) received 0 SACK options (SACK blocks) sent 0 SACK scoreboard overflow 0 packets with ECN CE bit set 0 packets with ECN ECT(0) bit set 0 packets with ECN ECT(1) bit set 0 successful ECN handshakes 0 times ECN reduced the congestion window 0 times in ONTAP flow control 0 times exited ONTAP flow control 0 times in ONTAP flow control for zero send window 0 times in ONTAP flow control for non-zero send window 0 connection resets due to ONTAP extreme flow control 0 times in ONTAP extreme flow control 0 is the maximum flow control reset threshold reached during receive 0 is the maximum flow control reset threshold reached during send 0 bytes is send buffer value during last reset 0 bytes is send buffer hiwat mark during last reset 0 times the receive window was closed 0 dropped due to flowcontrol 0 segments sent using TSO 0 bytes sent using TSO 0 TSO segments truncated 0 TSO wrapped sequence space segments 0 segments sent using TSO6 0 bytes sent using TSO6 0 TSO6 segments truncated 0 TSO6 wrapped sequence space segments 0 recv upcalls batched in HP 0 recv upcalls made in HP 0 recv upcalls made in HP because of PSH 0 recv upcalls made in HP because of sb_hiwat 0 recv upcalls made in HP because of both PSH and sb_hiwat 0 recv upcall batch timeouts 0 times recv upcall read partial sb_cc in HP 0 segments received using LRO 0 bytes received using LRO 0 segments received using LRO6 0 bytes received using LRO6 ---- ACP IPSpace ---- tcp: 86643 packets sent 17496 data packets (419904 bytes) 0 data packets (0 bytes) retransmitted 0 data packets unnecessarily retransmitted 0 resends initiated by MTU discovery 33848 ack-only packets (0 delayed) 0 URG only packets 0 window probe packets 23 window update packets 35276 control packets 74406 packets received 51152 acks (for 436064 bytes) 4798 duplicate acks 0 acks for unsent data 20938 packets (1251746 bytes) received in-sequence 0 completely duplicate packets (0 bytes) 0 old duplicate packets 0 packets with some dup. data (0 bytes duped) 0 out-of-order packets (0 bytes) 0 packets (0 bytes) of data after window 0 window probes 0 window update packets 1686 packets received after close 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 0 discarded due to memory problems 17605 connection requests 176 connection accepts 0 bad connection attempts 0 listen queue overflows 0 ignored RSTs in the windows 17672 connections established (including accepts) 17781 connections closed (including 2 drops) 0 connections updated cached RTT on close 0 connections updated cached RTT variance on close 0 connections updated cached ssthresh on close 0 embryonic connections dropped 51152 segments updated rtt (of 52750 attempts) 109 retransmit timeouts 0 connections dropped by rexmit timeout 0 persist timeouts 0 connections dropped by persist timeout 0 Connections (fin_wait_2) dropped because of timeout 0 keepalive timeouts 0 keepalive probes sent 0 connections dropped by keepalive 17474 correct ACK header predictions 4954 correct data packet header predictions 176 syncache entries added 0 retransmitted 0 dupsyn 0 dropped 176 completed 0 bucket overflow 0 cache overflow 0 reset 0 stale 0 aborted 0 badack 0 unreach 0 zone failures 176 cookies sent 0 cookies received 0 hostcache entries added 0 bucket overflow 0 SACK recovery episodes 0 segment rexmits in SACK recovery episodes 0 byte rexmits in SACK recovery episodes 0 SACK options (SACK blocks) received 0 SACK options (SACK blocks) sent 0 SACK scoreboard overflow 0 packets with ECN CE bit set 0 packets with ECN ECT(0) bit set 0 packets with ECN ECT(1) bit set 0 successful ECN handshakes 0 times ECN reduced the congestion window 0 times in ONTAP flow control 0 times exited ONTAP flow control 0 times in ONTAP flow control for zero send window 0 times in ONTAP flow control for non-zero send window 0 connection resets due to ONTAP extreme flow control 0 times in ONTAP extreme flow control 0 is the maximum flow control reset threshold reached during receive 0 is the maximum flow control reset threshold reached during send 0 bytes is send buffer value during last reset 0 bytes is send buffer hiwat mark during last reset 0 times the receive window was closed 0 dropped due to flowcontrol 0 segments sent using TSO 0 bytes sent using TSO 0 TSO segments truncated 0 TSO wrapped sequence space segments 0 segments sent using TSO6 0 bytes sent using TSO6 0 TSO6 segments truncated 0 TSO6 wrapped sequence space segments 0 recv upcalls batched in HP 0 recv upcalls made in HP 0 recv upcalls made in HP because of PSH 0 recv upcalls made in HP because of sb_hiwat 0 recv upcalls made in HP because of both PSH and sb_hiwat 0 recv upcall batch timeouts 0 times recv upcall read partial sb_cc in HP 0 segments received using LRO 0 bytes received using LRO 0 segments received using LRO6 0 bytes received using LRO6
Server tcp entries
[root@jwukccsbci ~]# sysctl -a | grep slot sunrpc.tcp_slot_table_entries = 128 sunrpc.udp_slot_table_entries = 128 dev.cdrom.info = drive # of slots: 1
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] Sent: 24 January 2018 11:53 To: Mark Saunders; Parisi, Justin; Fenn, Michael; toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Could this be TCP slot tables? Flow control capabilities on ONTAP continue to improve. If you don't have TCP slot tables capped at 128 you could see quasi-hangs like this.
Complete details are in TR-3633, but these are the two that you want to watch:
[root@stlrx300s7-145 mkdb]# sysctl -a | grep slot sunrpc.tcp_max_slot_table_entries = 128 sunrpc.tcp_slot_table_entries = 128
Newer versions of linux will allow a ridiculous number of unacknowledged RPC operations to build up. The result can be sending ONTAP into a flow control mode until the OS catches up. We see problems mostly in slow clients. For example, if you're trying to read a lot of data from a host with 1Gb connectivity on a high-end ONTAP system the OS can ask for data quicker than it can process the responses.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com] Sent: Wednesday, January 24, 2018 12:26 PM To: Parisi, Justin <Justin.Parisi@netapp.commailto:Justin.Parisi@netapp.com>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Justin
I have just checked the SVM and there are no admin/management interfaces configured for it there are three data lifs for different vlans. I have checked through our other systems this morning and there are no issues in vmware (5.5) or SLES 11/12 so this is just with the redhat servers.
I have checked the interfaces at the server end and it is not showing errors or dropped packets. On the filer end we have 4 physical ports in an interface group with vlans on top. I have run “statistics start –obeject nfs_exports_access_cache” which when checked doesn’t report any errors.
On the server interface
eth1 Link encap:Ethernet HWaddr 00:50:56:A5:0D:6A inet addr:10.240.1.30 Bcast:10.240.1.31 Mask:255.255.255.224 inet6 addr: fe80::250:56ff:fea5:d6a/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:127209 errors:0 dropped:0 overruns:0 frame:0 TX packets:26100 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:104158360 (99.3 MiB) TX bytes:14489402 (13.8 MiB)
While investigating we have found that the file system is fine just after a reboot and you can ls each mount so they are initially all OK. It is when starting the application so putting a bigger load over the network that the file systems stop responding.
Regards
Mark
From: Parisi, Justin [mailto:Justin.Parisi@netapp.com] Sent: 23 January 2018 22:33 To: Steiner, Jeffrey; Mark Saunders; Fenn, Michael; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?
“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”
From: Parisi, Justin Sent: Tuesday, January 23, 2018 5:30 PM To: Parisi, Justin <Justin.Parisi@netapp.commailto:Justin.Parisi@netapp.com>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
This community post also does a good job explaining it:
https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgr...
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin Sent: Tuesday, January 23, 2018 5:28 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.
https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-...
The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey Sent: Tuesday, January 23, 2018 5:24 PM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
I should have asked - is this SAP HANA or something like SAP on an Oracle database?
Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.
The thing about fastpath does ring a few bells.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com] Sent: Tuesday, January 23, 2018 11:18 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.
I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)
A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.
The mount options were kept fairly straight forward as
nfs nolock,_netdev,udp 0 0
and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.
nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0
How would I be able to tell if we are using DNFS ?
I will send you the support details tomorrow when I am back in the office.
Regards
Mark
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] Sent: 23 January 2018 17:29 To: Fenn, Michael; Mark Saunders; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.
I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.
From: Fenn, Michael [mailto:fennm@DEShawResearch.com] Sent: Tuesday, January 23, 2018 6:23 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: Re: NFS issue after upgrading filers to 9.2P2
The messages are not necessarily indicative of a network problem.
The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".
Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.
Thanks, Michael
From: <toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com> Date: Tuesday, January 23, 2018 at 10:38 AM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>, "toasters@teaparty.netmailto:toasters@teaparty.net" <toasters@teaparty.netmailto:toasters@teaparty.net> Subject: RE: NFS issue after upgrading filers to 9.2P2
Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.
I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.
I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.
What are the mount options used, and are you using DNFS?
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders Sent: Tuesday, January 23, 2018 4:29 PM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: NFS issue after upgrading filers to 9.2P2
Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.
In the server messages we get lots of entries for the SVM
Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:01:47 jwukccsbci last message repeated 5 times Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK Jan 23 07:02:07 jwukccsbci last message repeated 5 times Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:02:47 jwukccsbci last message repeated 5 times Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.
The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.
Regards Mark Data Centre Sysadmin Team Managed Services Phone:- 02476 694455 Ext 2567 The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~ The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.
The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
If that's 441463, I'm skeptical that's the problem. That might cause problems during boot, but I wouldn’t expect it to cause problems later. Also, an ONTAP upgrade shouldn't affect this.
I'll subscribe to the case and follow along. The stats below do show some possible problems. There was some flow control activity, and the SACK numbers look high to me.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com] Sent: Wednesday, January 24, 2018 1:02 PM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com; Parisi, Justin Justin.Parisi@netapp.com; Fenn, Michael fennm@DEShawResearch.com; toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
I will try to find the kernel bug number as I cant see it in the documentation for the server there is just the following note.
RHEL 5.11 has a bug where NFS mounts mounted after network initialization at boot run with an increased number of TCP requests (approx 10x more) which causes rpc backlog and restricts network throughput on the NFS mounts.
To resolve this a script has been created to restart the networking before the NFS mounts are mounted by netfs at boot. By default netfs runs at boot s25 on runlevel 3, 4 and 5 so we will set the NFS fix to run at s24 on the same run levels.
PGUKCSTGCL01::*> node run -node PGUKCSTGCL01-01 -command netstat -sp tcp ---- Default IPSpace ---- tcp: 900103907 packets sent 476280230 data packets (4676048494764 bytes) 61984 data packets (82328048 bytes) retransmitted 2065 data packets unnecessarily retransmitted 0 resends initiated by MTU discovery 235945463 ack-only packets (517654 delayed) 0 URG only packets 0 window probe packets 187429557 window update packets 333130 control packets 1097649475 packets received 399065895 acks (for 4676054895668 bytes) 2174268 duplicate acks 0 acks for unsent data 723809875 packets (4886339861169 bytes) received in-sequence 1649638 completely duplicate packets (98637034 bytes) 2 old duplicate packets 990 packets with some dup. data (214519 bytes duped) 10872239 out-of-order packets (15192422547 bytes) 0 packets (0 bytes) of data after window 0 window probes 26845 window update packets 2 packets received after close 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 37581 discarded due to memory problems 1441 connection requests 412966 connection accepts 0 bad connection attempts 0 listen queue overflows 305109 ignored RSTs in the windows 414403 connections established (including accepts) 443890 connections closed (including 139609 drops) 151376 connections updated cached RTT on close 151388 connections updated cached RTT variance on close 140203 connections updated cached ssthresh on close 0 embryonic connections dropped 388403781 segments updated rtt (of 258539924 attempts) 6843 retransmit timeouts 11 connections dropped by rexmit timeout 3 persist timeouts 0 connections dropped by persist timeout 0 Connections (fin_wait_2) dropped because of timeout 92323 keepalive timeouts 92323 keepalive probes sent 0 connections dropped by keepalive 351415606 correct ACK header predictions 684179955 correct data packet header predictions 412966 syncache entries added 155 retransmitted 302 dupsyn 0 dropped 412966 completed 0 bucket overflow 0 cache overflow 0 reset 0 stale 0 aborted 0 badack 0 unreach 0 zone failures 412966 cookies sent 0 cookies received 112 hostcache entries added 0 bucket overflow 16181 SACK recovery episodes 51541 segment rexmits in SACK recovery episodes 70735551 byte rexmits in SACK recovery episodes 277116 SACK options (SACK blocks) received 11457931 SACK options (SACK blocks) sent 0 SACK scoreboard overflow 0 packets with ECN CE bit set 0 packets with ECN ECT(0) bit set 0 packets with ECN ECT(1) bit set 0 successful ECN handshakes 0 times ECN reduced the congestion window 251543 times in ONTAP flow control 0 times exited ONTAP flow control 0 times in ONTAP flow control for zero send window 251543 times in ONTAP flow control for non-zero send window 0 connection resets due to ONTAP extreme flow control 0 times in ONTAP extreme flow control 0 is the maximum flow control reset threshold reached during receive 4 is the maximum flow control reset threshold reached during send 0 bytes is send buffer value during last reset 0 bytes is send buffer hiwat mark during last reset 79 times the receive window was closed 44 dropped due to flowcontrol 188382441 segments sent using TSO 4595103991390 bytes sent using TSO 73883767 TSO segments truncated 1069 TSO wrapped sequence space segments 0 segments sent using TSO6 0 bytes sent using TSO6 0 TSO6 segments truncated 0 TSO6 wrapped sequence space segments 366670238 recv upcalls batched in HP 302647105 recv upcalls made in HP 296877004 recv upcalls made in HP because of PSH 2291336 recv upcalls made in HP because of sb_hiwat 3481239 recv upcalls made in HP because of both PSH and sb_hiwat 6733214 recv upcall batch timeouts 16594187 times recv upcall read partial sb_cc in HP 631681762 segments received using LRO 4816721023400 bytes received using LRO 0 segments received using LRO6 0 bytes received using LRO6 ---- ANYVSERVER IPSpace ---- tcp: 0 packets sent 0 data packets (0 bytes) 0 data packets (0 bytes) retransmitted 0 data packets unnecessarily retransmitted 0 resends initiated by MTU discovery 0 ack-only packets (0 delayed) 0 URG only packets 0 window probe packets 0 window update packets 0 control packets 0 packets received 0 acks (for 0 bytes) 0 duplicate acks 0 acks for unsent data 0 packets (0 bytes) received in-sequence 0 completely duplicate packets (0 bytes) 0 old duplicate packets 0 packets with some dup. data (0 bytes duped) 0 out-of-order packets (0 bytes) 0 packets (0 bytes) of data after window 0 window probes 0 window update packets 0 packets received after close 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 0 discarded due to memory problems 0 connection requests 0 connection accepts 0 bad connection attempts 0 listen queue overflows 0 ignored RSTs in the windows 0 connections established (including accepts) 7 connections closed (including 0 drops) 0 connections updated cached RTT on close 0 connections updated cached RTT variance on close 0 connections updated cached ssthresh on close 0 embryonic connections dropped 0 segments updated rtt (of 0 attempts) 0 retransmit timeouts 0 connections dropped by rexmit timeout 0 persist timeouts 0 connections dropped by persist timeout 0 Connections (fin_wait_2) dropped because of timeout 0 keepalive timeouts 0 keepalive probes sent 0 connections dropped by keepalive 0 correct ACK header predictions 0 correct data packet header predictions 0 syncache entries added 0 retransmitted 0 dupsyn 0 dropped 0 completed 0 bucket overflow 0 cache overflow 0 reset 0 stale 0 aborted 0 badack 0 unreach 0 zone failures 0 cookies sent 0 cookies received 0 hostcache entries added 0 bucket overflow 0 SACK recovery episodes 0 segment rexmits in SACK recovery episodes 0 byte rexmits in SACK recovery episodes 0 SACK options (SACK blocks) received 0 SACK options (SACK blocks) sent 0 SACK scoreboard overflow 0 packets with ECN CE bit set 0 packets with ECN ECT(0) bit set 0 packets with ECN ECT(1) bit set 0 successful ECN handshakes 0 times ECN reduced the congestion window 0 times in ONTAP flow control 0 times exited ONTAP flow control 0 times in ONTAP flow control for zero send window 0 times in ONTAP flow control for non-zero send window 0 connection resets due to ONTAP extreme flow control 0 times in ONTAP extreme flow control 0 is the maximum flow control reset threshold reached during receive 0 is the maximum flow control reset threshold reached during send 0 bytes is send buffer value during last reset 0 bytes is send buffer hiwat mark during last reset 0 times the receive window was closed 0 dropped due to flowcontrol 0 segments sent using TSO 0 bytes sent using TSO 0 TSO segments truncated 0 TSO wrapped sequence space segments 0 segments sent using TSO6 0 bytes sent using TSO6 0 TSO6 segments truncated 0 TSO6 wrapped sequence space segments 0 recv upcalls batched in HP 0 recv upcalls made in HP 0 recv upcalls made in HP because of PSH 0 recv upcalls made in HP because of sb_hiwat 0 recv upcalls made in HP because of both PSH and sb_hiwat 0 recv upcall batch timeouts 0 times recv upcall read partial sb_cc in HP 0 segments received using LRO 0 bytes received using LRO 0 segments received using LRO6 0 bytes received using LRO6 ---- Cluster IPSpace ---- tcp: 350960787 packets sent 253625385 data packets (2042642509989 bytes) 11525 data packets (120517203 bytes) retransmitted 63 data packets unnecessarily retransmitted 0 resends initiated by MTU discovery 38550609 ack-only packets (15348627 delayed) 0 URG only packets 1 window probe packet 56728197 window update packets 2035396 control packets 341097715 packets received 224460892 acks (for 2042726150883 bytes) 6840725 duplicate acks 0 acks for unsent data 271870811 packets (3031038679110 bytes) received in-sequence 195650 completely duplicate packets (4506 bytes) 49 old duplicate packets 0 packets with some dup. data (0 bytes duped) 205398 out-of-order packets (565766073 bytes) 0 packets (0 bytes) of data after window 0 window probes 2011210 window update packets 123 packets received after close 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 0 discarded due to memory problems 923539 connection requests 456892 connection accepts 0 bad connection attempts 0 listen queue overflows 529 ignored RSTs in the windows 1271558 connections established (including accepts) 1379180 connections closed (including 1101 drops) 369895 connections updated cached RTT on close 370750 connections updated cached RTT variance on close 12122 connections updated cached ssthresh on close 108207 embryonic connections dropped 224454663 segments updated rtt (of 207849890 attempts) 48471 retransmit timeouts 14 connections dropped by rexmit timeout 1 persist timeout 0 connections dropped by persist timeout 0 Connections (fin_wait_2) dropped because of timeout 152128 keepalive timeouts 147328 keepalive probes sent 4800 connections dropped by keepalive 45057764 correct ACK header predictions 104981779 correct data packet header predictions 457057 syncache entries added 61 retransmitted 0 dupsyn 0 dropped 456892 completed 0 bucket overflow 0 cache overflow 165 reset 0 stale 0 aborted 0 badack 0 unreach 0 zone failures 457057 cookies sent 0 cookies received 61 hostcache entries added 0 bucket overflow 1684 SACK recovery episodes 2491 segment rexmits in SACK recovery episodes 5618157 byte rexmits in SACK recovery episodes 17518 SACK options (SACK blocks) received 86946 SACK options (SACK blocks) sent 0 SACK scoreboard overflow 0 packets with ECN CE bit set 0 packets with ECN ECT(0) bit set 0 packets with ECN ECT(1) bit set 0 successful ECN handshakes 0 times ECN reduced the congestion window 0 times in ONTAP flow control 0 times exited ONTAP flow control 0 times in ONTAP flow control for zero send window 0 times in ONTAP flow control for non-zero send window 0 connection resets due to ONTAP extreme flow control 0 times in ONTAP extreme flow control 0 is the maximum flow control reset threshold reached during receive 0 is the maximum flow control reset threshold reached during send 0 bytes is send buffer value during last reset 0 bytes is send buffer hiwat mark during last reset 0 times the receive window was closed 0 dropped due to flowcontrol 56607835 segments sent using TSO 1679494142753 bytes sent using TSO 36473474 TSO segments truncated 394 TSO wrapped sequence space segments 0 segments sent using TSO6 0 bytes sent using TSO6 0 TSO6 segments truncated 0 TSO6 wrapped sequence space segments 4879278 recv upcalls batched in HP 90401291 recv upcalls made in HP 90401967 recv upcalls made in HP because of PSH 52 recv upcalls made in HP because of sb_hiwat 325 recv upcalls made in HP because of both PSH and sb_hiwat 32882 recv upcall batch timeouts 524 times recv upcall read partial sb_cc in HP 160827213 segments received using LRO 2789346524807 bytes received using LRO 0 segments received using LRO6 0 bytes received using LRO6 ---- ips_4294967289 IPSpace ---- tcp: 0 packets sent 0 data packets (0 bytes) 0 data packets (0 bytes) retransmitted 0 data packets unnecessarily retransmitted 0 resends initiated by MTU discovery 0 ack-only packets (0 delayed) 0 URG only packets 0 window probe packets 0 window update packets 0 control packets 0 packets received 0 acks (for 0 bytes) 0 duplicate acks 0 acks for unsent data 0 packets (0 bytes) received in-sequence 0 completely duplicate packets (0 bytes) 0 old duplicate packets 0 packets with some dup. data (0 bytes duped) 0 out-of-order packets (0 bytes) 0 packets (0 bytes) of data after window 0 window probes 0 window update packets 0 packets received after close 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 0 discarded due to memory problems 0 connection requests 0 connection accepts 0 bad connection attempts 0 listen queue overflows 0 ignored RSTs in the windows 0 connections established (including accepts) 0 connections closed (including 0 drops) 0 connections updated cached RTT on close 0 connections updated cached RTT variance on close 0 connections updated cached ssthresh on close 0 embryonic connections dropped 0 segments updated rtt (of 0 attempts) 0 retransmit timeouts 0 connections dropped by rexmit timeout 0 persist timeouts 0 connections dropped by persist timeout 0 Connections (fin_wait_2) dropped because of timeout 0 keepalive timeouts 0 keepalive probes sent 0 connections dropped by keepalive 0 correct ACK header predictions 0 correct data packet header predictions 0 syncache entries added 0 retransmitted 0 dupsyn 0 dropped 0 completed 0 bucket overflow 0 cache overflow 0 reset 0 stale 0 aborted 0 badack 0 unreach 0 zone failures 0 cookies sent 0 cookies received 0 hostcache entries added 0 bucket overflow 0 SACK recovery episodes 0 segment rexmits in SACK recovery episodes 0 byte rexmits in SACK recovery episodes 0 SACK options (SACK blocks) received 0 SACK options (SACK blocks) sent 0 SACK scoreboard overflow 0 packets with ECN CE bit set 0 packets with ECN ECT(0) bit set 0 packets with ECN ECT(1) bit set 0 successful ECN handshakes 0 times ECN reduced the congestion window 0 times in ONTAP flow control 0 times exited ONTAP flow control 0 times in ONTAP flow control for zero send window 0 times in ONTAP flow control for non-zero send window 0 connection resets due to ONTAP extreme flow control 0 times in ONTAP extreme flow control 0 is the maximum flow control reset threshold reached during receive 0 is the maximum flow control reset threshold reached during send 0 bytes is send buffer value during last reset 0 bytes is send buffer hiwat mark during last reset 0 times the receive window was closed 0 dropped due to flowcontrol 0 segments sent using TSO 0 bytes sent using TSO 0 TSO segments truncated 0 TSO wrapped sequence space segments 0 segments sent using TSO6 0 bytes sent using TSO6 0 TSO6 segments truncated 0 TSO6 wrapped sequence space segments 0 recv upcalls batched in HP 0 recv upcalls made in HP 0 recv upcalls made in HP because of PSH 0 recv upcalls made in HP because of sb_hiwat 0 recv upcalls made in HP because of both PSH and sb_hiwat 0 recv upcall batch timeouts 0 times recv upcall read partial sb_cc in HP 0 segments received using LRO 0 bytes received using LRO 0 segments received using LRO6 0 bytes received using LRO6 ---- ACP IPSpace ---- tcp: 86643 packets sent 17496 data packets (419904 bytes) 0 data packets (0 bytes) retransmitted 0 data packets unnecessarily retransmitted 0 resends initiated by MTU discovery 33848 ack-only packets (0 delayed) 0 URG only packets 0 window probe packets 23 window update packets 35276 control packets 74406 packets received 51152 acks (for 436064 bytes) 4798 duplicate acks 0 acks for unsent data 20938 packets (1251746 bytes) received in-sequence 0 completely duplicate packets (0 bytes) 0 old duplicate packets 0 packets with some dup. data (0 bytes duped) 0 out-of-order packets (0 bytes) 0 packets (0 bytes) of data after window 0 window probes 0 window update packets 1686 packets received after close 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 0 discarded due to memory problems 17605 connection requests 176 connection accepts 0 bad connection attempts 0 listen queue overflows 0 ignored RSTs in the windows 17672 connections established (including accepts) 17781 connections closed (including 2 drops) 0 connections updated cached RTT on close 0 connections updated cached RTT variance on close 0 connections updated cached ssthresh on close 0 embryonic connections dropped 51152 segments updated rtt (of 52750 attempts) 109 retransmit timeouts 0 connections dropped by rexmit timeout 0 persist timeouts 0 connections dropped by persist timeout 0 Connections (fin_wait_2) dropped because of timeout 0 keepalive timeouts 0 keepalive probes sent 0 connections dropped by keepalive 17474 correct ACK header predictions 4954 correct data packet header predictions 176 syncache entries added 0 retransmitted 0 dupsyn 0 dropped 176 completed 0 bucket overflow 0 cache overflow 0 reset 0 stale 0 aborted 0 badack 0 unreach 0 zone failures 176 cookies sent 0 cookies received 0 hostcache entries added 0 bucket overflow 0 SACK recovery episodes 0 segment rexmits in SACK recovery episodes 0 byte rexmits in SACK recovery episodes 0 SACK options (SACK blocks) received 0 SACK options (SACK blocks) sent 0 SACK scoreboard overflow 0 packets with ECN CE bit set 0 packets with ECN ECT(0) bit set 0 packets with ECN ECT(1) bit set 0 successful ECN handshakes 0 times ECN reduced the congestion window 0 times in ONTAP flow control 0 times exited ONTAP flow control 0 times in ONTAP flow control for zero send window 0 times in ONTAP flow control for non-zero send window 0 connection resets due to ONTAP extreme flow control 0 times in ONTAP extreme flow control 0 is the maximum flow control reset threshold reached during receive 0 is the maximum flow control reset threshold reached during send 0 bytes is send buffer value during last reset 0 bytes is send buffer hiwat mark during last reset 0 times the receive window was closed 0 dropped due to flowcontrol 0 segments sent using TSO 0 bytes sent using TSO 0 TSO segments truncated 0 TSO wrapped sequence space segments 0 segments sent using TSO6 0 bytes sent using TSO6 0 TSO6 segments truncated 0 TSO6 wrapped sequence space segments 0 recv upcalls batched in HP 0 recv upcalls made in HP 0 recv upcalls made in HP because of PSH 0 recv upcalls made in HP because of sb_hiwat 0 recv upcalls made in HP because of both PSH and sb_hiwat 0 recv upcall batch timeouts 0 times recv upcall read partial sb_cc in HP 0 segments received using LRO 0 bytes received using LRO 0 segments received using LRO6 0 bytes received using LRO6
Server tcp entries
[root@jwukccsbci ~]# sysctl -a | grep slot sunrpc.tcp_slot_table_entries = 128 sunrpc.udp_slot_table_entries = 128 dev.cdrom.info = drive # of slots: 1
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] Sent: 24 January 2018 11:53 To: Mark Saunders; Parisi, Justin; Fenn, Michael; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Could this be TCP slot tables? Flow control capabilities on ONTAP continue to improve. If you don't have TCP slot tables capped at 128 you could see quasi-hangs like this.
Complete details are in TR-3633, but these are the two that you want to watch:
[root@stlrx300s7-145 mkdb]# sysctl -a | grep slot sunrpc.tcp_max_slot_table_entries = 128 sunrpc.tcp_slot_table_entries = 128
Newer versions of linux will allow a ridiculous number of unacknowledged RPC operations to build up. The result can be sending ONTAP into a flow control mode until the OS catches up. We see problems mostly in slow clients. For example, if you're trying to read a lot of data from a host with 1Gb connectivity on a high-end ONTAP system the OS can ask for data quicker than it can process the responses.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com] Sent: Wednesday, January 24, 2018 12:26 PM To: Parisi, Justin <Justin.Parisi@netapp.commailto:Justin.Parisi@netapp.com>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Justin
I have just checked the SVM and there are no admin/management interfaces configured for it there are three data lifs for different vlans. I have checked through our other systems this morning and there are no issues in vmware (5.5) or SLES 11/12 so this is just with the redhat servers.
I have checked the interfaces at the server end and it is not showing errors or dropped packets. On the filer end we have 4 physical ports in an interface group with vlans on top. I have run “statistics start –obeject nfs_exports_access_cache” which when checked doesn’t report any errors.
On the server interface
eth1 Link encap:Ethernet HWaddr 00:50:56:A5:0D:6A inet addr:10.240.1.30 Bcast:10.240.1.31 Mask:255.255.255.224 inet6 addr: fe80::250:56ff:fea5:d6a/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:127209 errors:0 dropped:0 overruns:0 frame:0 TX packets:26100 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:104158360 (99.3 MiB) TX bytes:14489402 (13.8 MiB)
While investigating we have found that the file system is fine just after a reboot and you can ls each mount so they are initially all OK. It is when starting the application so putting a bigger load over the network that the file systems stop responding.
Regards
Mark
From: Parisi, Justin [mailto:Justin.Parisi@netapp.com] Sent: 23 January 2018 22:33 To: Steiner, Jeffrey; Mark Saunders; Fenn, Michael; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?
“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”
From: Parisi, Justin Sent: Tuesday, January 23, 2018 5:30 PM To: Parisi, Justin <Justin.Parisi@netapp.commailto:Justin.Parisi@netapp.com>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
This community post also does a good job explaining it:
https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgr...
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin Sent: Tuesday, January 23, 2018 5:28 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.
https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-...
The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey Sent: Tuesday, January 23, 2018 5:24 PM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
I should have asked - is this SAP HANA or something like SAP on an Oracle database?
Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.
The thing about fastpath does ring a few bells.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com] Sent: Tuesday, January 23, 2018 11:18 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.
I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)
A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.
The mount options were kept fairly straight forward as
nfs nolock,_netdev,udp 0 0
and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.
nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0
How would I be able to tell if we are using DNFS ?
I will send you the support details tomorrow when I am back in the office.
Regards
Mark
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] Sent: 23 January 2018 17:29 To: Fenn, Michael; Mark Saunders; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.
I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.
From: Fenn, Michael [mailto:fennm@DEShawResearch.com] Sent: Tuesday, January 23, 2018 6:23 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: Re: NFS issue after upgrading filers to 9.2P2
The messages are not necessarily indicative of a network problem.
The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".
Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.
Thanks, Michael
From: <toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com> Date: Tuesday, January 23, 2018 at 10:38 AM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>, "toasters@teaparty.netmailto:toasters@teaparty.net" <toasters@teaparty.netmailto:toasters@teaparty.net> Subject: RE: NFS issue after upgrading filers to 9.2P2
Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.
I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.
I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.
What are the mount options used, and are you using DNFS?
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders Sent: Tuesday, January 23, 2018 4:29 PM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: NFS issue after upgrading filers to 9.2P2
Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.
In the server messages we get lots of entries for the SVM
Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:01:47 jwukccsbci last message repeated 5 times Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK Jan 23 07:02:07 jwukccsbci last message repeated 5 times Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:02:47 jwukccsbci last message repeated 5 times Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.
The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.
Regards Mark Data Centre Sysadmin Team Managed Services Phone:- 02476 694455 Ext 2567 The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~ The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.
The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
After a bit of an email search it was this bug
https://bugzilla.redhat.com/show_bug.cgi?id=321111
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] Sent: 24 January 2018 12:07 To: Mark Saunders; Parisi, Justin; Fenn, Michael; toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
If that's 441463, I'm skeptical that's the problem. That might cause problems during boot, but I wouldn’t expect it to cause problems later. Also, an ONTAP upgrade shouldn't affect this.
I'll subscribe to the case and follow along. The stats below do show some possible problems. There was some flow control activity, and the SACK numbers look high to me.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com] Sent: Wednesday, January 24, 2018 1:02 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Parisi, Justin <Justin.Parisi@netapp.commailto:Justin.Parisi@netapp.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
I will try to find the kernel bug number as I cant see it in the documentation for the server there is just the following note.
RHEL 5.11 has a bug where NFS mounts mounted after network initialization at boot run with an increased number of TCP requests (approx 10x more) which causes rpc backlog and restricts network throughput on the NFS mounts.
To resolve this a script has been created to restart the networking before the NFS mounts are mounted by netfs at boot. By default netfs runs at boot s25 on runlevel 3, 4 and 5 so we will set the NFS fix to run at s24 on the same run levels.
PGUKCSTGCL01::*> node run -node PGUKCSTGCL01-01 -command netstat -sp tcp ---- Default IPSpace ---- tcp: 900103907 packets sent 476280230 data packets (4676048494764 bytes) 61984 data packets (82328048 bytes) retransmitted 2065 data packets unnecessarily retransmitted 0 resends initiated by MTU discovery 235945463 ack-only packets (517654 delayed) 0 URG only packets 0 window probe packets 187429557 window update packets 333130 control packets 1097649475 packets received 399065895 acks (for 4676054895668 bytes) 2174268 duplicate acks 0 acks for unsent data 723809875 packets (4886339861169 bytes) received in-sequence 1649638 completely duplicate packets (98637034 bytes) 2 old duplicate packets 990 packets with some dup. data (214519 bytes duped) 10872239 out-of-order packets (15192422547 bytes) 0 packets (0 bytes) of data after window 0 window probes 26845 window update packets 2 packets received after close 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 37581 discarded due to memory problems 1441 connection requests 412966 connection accepts 0 bad connection attempts 0 listen queue overflows 305109 ignored RSTs in the windows 414403 connections established (including accepts) 443890 connections closed (including 139609 drops) 151376 connections updated cached RTT on close 151388 connections updated cached RTT variance on close 140203 connections updated cached ssthresh on close 0 embryonic connections dropped 388403781 segments updated rtt (of 258539924 attempts) 6843 retransmit timeouts 11 connections dropped by rexmit timeout 3 persist timeouts 0 connections dropped by persist timeout 0 Connections (fin_wait_2) dropped because of timeout 92323 keepalive timeouts 92323 keepalive probes sent 0 connections dropped by keepalive 351415606 correct ACK header predictions 684179955 correct data packet header predictions 412966 syncache entries added 155 retransmitted 302 dupsyn 0 dropped 412966 completed 0 bucket overflow 0 cache overflow 0 reset 0 stale 0 aborted 0 badack 0 unreach 0 zone failures 412966 cookies sent 0 cookies received 112 hostcache entries added 0 bucket overflow 16181 SACK recovery episodes 51541 segment rexmits in SACK recovery episodes 70735551 byte rexmits in SACK recovery episodes 277116 SACK options (SACK blocks) received 11457931 SACK options (SACK blocks) sent 0 SACK scoreboard overflow 0 packets with ECN CE bit set 0 packets with ECN ECT(0) bit set 0 packets with ECN ECT(1) bit set 0 successful ECN handshakes 0 times ECN reduced the congestion window 251543 times in ONTAP flow control 0 times exited ONTAP flow control 0 times in ONTAP flow control for zero send window 251543 times in ONTAP flow control for non-zero send window 0 connection resets due to ONTAP extreme flow control 0 times in ONTAP extreme flow control 0 is the maximum flow control reset threshold reached during receive 4 is the maximum flow control reset threshold reached during send 0 bytes is send buffer value during last reset 0 bytes is send buffer hiwat mark during last reset 79 times the receive window was closed 44 dropped due to flowcontrol 188382441 segments sent using TSO 4595103991390 bytes sent using TSO 73883767 TSO segments truncated 1069 TSO wrapped sequence space segments 0 segments sent using TSO6 0 bytes sent using TSO6 0 TSO6 segments truncated 0 TSO6 wrapped sequence space segments 366670238 recv upcalls batched in HP 302647105 recv upcalls made in HP 296877004 recv upcalls made in HP because of PSH 2291336 recv upcalls made in HP because of sb_hiwat 3481239 recv upcalls made in HP because of both PSH and sb_hiwat 6733214 recv upcall batch timeouts 16594187 times recv upcall read partial sb_cc in HP 631681762 segments received using LRO 4816721023400 bytes received using LRO 0 segments received using LRO6 0 bytes received using LRO6 ---- ANYVSERVER IPSpace ---- tcp: 0 packets sent 0 data packets (0 bytes) 0 data packets (0 bytes) retransmitted 0 data packets unnecessarily retransmitted 0 resends initiated by MTU discovery 0 ack-only packets (0 delayed) 0 URG only packets 0 window probe packets 0 window update packets 0 control packets 0 packets received 0 acks (for 0 bytes) 0 duplicate acks 0 acks for unsent data 0 packets (0 bytes) received in-sequence 0 completely duplicate packets (0 bytes) 0 old duplicate packets 0 packets with some dup. data (0 bytes duped) 0 out-of-order packets (0 bytes) 0 packets (0 bytes) of data after window 0 window probes 0 window update packets 0 packets received after close 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 0 discarded due to memory problems 0 connection requests 0 connection accepts 0 bad connection attempts 0 listen queue overflows 0 ignored RSTs in the windows 0 connections established (including accepts) 7 connections closed (including 0 drops) 0 connections updated cached RTT on close 0 connections updated cached RTT variance on close 0 connections updated cached ssthresh on close 0 embryonic connections dropped 0 segments updated rtt (of 0 attempts) 0 retransmit timeouts 0 connections dropped by rexmit timeout 0 persist timeouts 0 connections dropped by persist timeout 0 Connections (fin_wait_2) dropped because of timeout 0 keepalive timeouts 0 keepalive probes sent 0 connections dropped by keepalive 0 correct ACK header predictions 0 correct data packet header predictions 0 syncache entries added 0 retransmitted 0 dupsyn 0 dropped 0 completed 0 bucket overflow 0 cache overflow 0 reset 0 stale 0 aborted 0 badack 0 unreach 0 zone failures 0 cookies sent 0 cookies received 0 hostcache entries added 0 bucket overflow 0 SACK recovery episodes 0 segment rexmits in SACK recovery episodes 0 byte rexmits in SACK recovery episodes 0 SACK options (SACK blocks) received 0 SACK options (SACK blocks) sent 0 SACK scoreboard overflow 0 packets with ECN CE bit set 0 packets with ECN ECT(0) bit set 0 packets with ECN ECT(1) bit set 0 successful ECN handshakes 0 times ECN reduced the congestion window 0 times in ONTAP flow control 0 times exited ONTAP flow control 0 times in ONTAP flow control for zero send window 0 times in ONTAP flow control for non-zero send window 0 connection resets due to ONTAP extreme flow control 0 times in ONTAP extreme flow control 0 is the maximum flow control reset threshold reached during receive 0 is the maximum flow control reset threshold reached during send 0 bytes is send buffer value during last reset 0 bytes is send buffer hiwat mark during last reset 0 times the receive window was closed 0 dropped due to flowcontrol 0 segments sent using TSO 0 bytes sent using TSO 0 TSO segments truncated 0 TSO wrapped sequence space segments 0 segments sent using TSO6 0 bytes sent using TSO6 0 TSO6 segments truncated 0 TSO6 wrapped sequence space segments 0 recv upcalls batched in HP 0 recv upcalls made in HP 0 recv upcalls made in HP because of PSH 0 recv upcalls made in HP because of sb_hiwat 0 recv upcalls made in HP because of both PSH and sb_hiwat 0 recv upcall batch timeouts 0 times recv upcall read partial sb_cc in HP 0 segments received using LRO 0 bytes received using LRO 0 segments received using LRO6 0 bytes received using LRO6 ---- Cluster IPSpace ---- tcp: 350960787 packets sent 253625385 data packets (2042642509989 bytes) 11525 data packets (120517203 bytes) retransmitted 63 data packets unnecessarily retransmitted 0 resends initiated by MTU discovery 38550609 ack-only packets (15348627 delayed) 0 URG only packets 1 window probe packet 56728197 window update packets 2035396 control packets 341097715 packets received 224460892 acks (for 2042726150883 bytes) 6840725 duplicate acks 0 acks for unsent data 271870811 packets (3031038679110 bytes) received in-sequence 195650 completely duplicate packets (4506 bytes) 49 old duplicate packets 0 packets with some dup. data (0 bytes duped) 205398 out-of-order packets (565766073 bytes) 0 packets (0 bytes) of data after window 0 window probes 2011210 window update packets 123 packets received after close 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 0 discarded due to memory problems 923539 connection requests 456892 connection accepts 0 bad connection attempts 0 listen queue overflows 529 ignored RSTs in the windows 1271558 connections established (including accepts) 1379180 connections closed (including 1101 drops) 369895 connections updated cached RTT on close 370750 connections updated cached RTT variance on close 12122 connections updated cached ssthresh on close 108207 embryonic connections dropped 224454663 segments updated rtt (of 207849890 attempts) 48471 retransmit timeouts 14 connections dropped by rexmit timeout 1 persist timeout 0 connections dropped by persist timeout 0 Connections (fin_wait_2) dropped because of timeout 152128 keepalive timeouts 147328 keepalive probes sent 4800 connections dropped by keepalive 45057764 correct ACK header predictions 104981779 correct data packet header predictions 457057 syncache entries added 61 retransmitted 0 dupsyn 0 dropped 456892 completed 0 bucket overflow 0 cache overflow 165 reset 0 stale 0 aborted 0 badack 0 unreach 0 zone failures 457057 cookies sent 0 cookies received 61 hostcache entries added 0 bucket overflow 1684 SACK recovery episodes 2491 segment rexmits in SACK recovery episodes 5618157 byte rexmits in SACK recovery episodes 17518 SACK options (SACK blocks) received 86946 SACK options (SACK blocks) sent 0 SACK scoreboard overflow 0 packets with ECN CE bit set 0 packets with ECN ECT(0) bit set 0 packets with ECN ECT(1) bit set 0 successful ECN handshakes 0 times ECN reduced the congestion window 0 times in ONTAP flow control 0 times exited ONTAP flow control 0 times in ONTAP flow control for zero send window 0 times in ONTAP flow control for non-zero send window 0 connection resets due to ONTAP extreme flow control 0 times in ONTAP extreme flow control 0 is the maximum flow control reset threshold reached during receive 0 is the maximum flow control reset threshold reached during send 0 bytes is send buffer value during last reset 0 bytes is send buffer hiwat mark during last reset 0 times the receive window was closed 0 dropped due to flowcontrol 56607835 segments sent using TSO 1679494142753 bytes sent using TSO 36473474 TSO segments truncated 394 TSO wrapped sequence space segments 0 segments sent using TSO6 0 bytes sent using TSO6 0 TSO6 segments truncated 0 TSO6 wrapped sequence space segments 4879278 recv upcalls batched in HP 90401291 recv upcalls made in HP 90401967 recv upcalls made in HP because of PSH 52 recv upcalls made in HP because of sb_hiwat 325 recv upcalls made in HP because of both PSH and sb_hiwat 32882 recv upcall batch timeouts 524 times recv upcall read partial sb_cc in HP 160827213 segments received using LRO 2789346524807 bytes received using LRO 0 segments received using LRO6 0 bytes received using LRO6 ---- ips_4294967289 IPSpace ---- tcp: 0 packets sent 0 data packets (0 bytes) 0 data packets (0 bytes) retransmitted 0 data packets unnecessarily retransmitted 0 resends initiated by MTU discovery 0 ack-only packets (0 delayed) 0 URG only packets 0 window probe packets 0 window update packets 0 control packets 0 packets received 0 acks (for 0 bytes) 0 duplicate acks 0 acks for unsent data 0 packets (0 bytes) received in-sequence 0 completely duplicate packets (0 bytes) 0 old duplicate packets 0 packets with some dup. data (0 bytes duped) 0 out-of-order packets (0 bytes) 0 packets (0 bytes) of data after window 0 window probes 0 window update packets 0 packets received after close 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 0 discarded due to memory problems 0 connection requests 0 connection accepts 0 bad connection attempts 0 listen queue overflows 0 ignored RSTs in the windows 0 connections established (including accepts) 0 connections closed (including 0 drops) 0 connections updated cached RTT on close 0 connections updated cached RTT variance on close 0 connections updated cached ssthresh on close 0 embryonic connections dropped 0 segments updated rtt (of 0 attempts) 0 retransmit timeouts 0 connections dropped by rexmit timeout 0 persist timeouts 0 connections dropped by persist timeout 0 Connections (fin_wait_2) dropped because of timeout 0 keepalive timeouts 0 keepalive probes sent 0 connections dropped by keepalive 0 correct ACK header predictions 0 correct data packet header predictions 0 syncache entries added 0 retransmitted 0 dupsyn 0 dropped 0 completed 0 bucket overflow 0 cache overflow 0 reset 0 stale 0 aborted 0 badack 0 unreach 0 zone failures 0 cookies sent 0 cookies received 0 hostcache entries added 0 bucket overflow 0 SACK recovery episodes 0 segment rexmits in SACK recovery episodes 0 byte rexmits in SACK recovery episodes 0 SACK options (SACK blocks) received 0 SACK options (SACK blocks) sent 0 SACK scoreboard overflow 0 packets with ECN CE bit set 0 packets with ECN ECT(0) bit set 0 packets with ECN ECT(1) bit set 0 successful ECN handshakes 0 times ECN reduced the congestion window 0 times in ONTAP flow control 0 times exited ONTAP flow control 0 times in ONTAP flow control for zero send window 0 times in ONTAP flow control for non-zero send window 0 connection resets due to ONTAP extreme flow control 0 times in ONTAP extreme flow control 0 is the maximum flow control reset threshold reached during receive 0 is the maximum flow control reset threshold reached during send 0 bytes is send buffer value during last reset 0 bytes is send buffer hiwat mark during last reset 0 times the receive window was closed 0 dropped due to flowcontrol 0 segments sent using TSO 0 bytes sent using TSO 0 TSO segments truncated 0 TSO wrapped sequence space segments 0 segments sent using TSO6 0 bytes sent using TSO6 0 TSO6 segments truncated 0 TSO6 wrapped sequence space segments 0 recv upcalls batched in HP 0 recv upcalls made in HP 0 recv upcalls made in HP because of PSH 0 recv upcalls made in HP because of sb_hiwat 0 recv upcalls made in HP because of both PSH and sb_hiwat 0 recv upcall batch timeouts 0 times recv upcall read partial sb_cc in HP 0 segments received using LRO 0 bytes received using LRO 0 segments received using LRO6 0 bytes received using LRO6 ---- ACP IPSpace ---- tcp: 86643 packets sent 17496 data packets (419904 bytes) 0 data packets (0 bytes) retransmitted 0 data packets unnecessarily retransmitted 0 resends initiated by MTU discovery 33848 ack-only packets (0 delayed) 0 URG only packets 0 window probe packets 23 window update packets 35276 control packets 74406 packets received 51152 acks (for 436064 bytes) 4798 duplicate acks 0 acks for unsent data 20938 packets (1251746 bytes) received in-sequence 0 completely duplicate packets (0 bytes) 0 old duplicate packets 0 packets with some dup. data (0 bytes duped) 0 out-of-order packets (0 bytes) 0 packets (0 bytes) of data after window 0 window probes 0 window update packets 1686 packets received after close 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 0 discarded due to memory problems 17605 connection requests 176 connection accepts 0 bad connection attempts 0 listen queue overflows 0 ignored RSTs in the windows 17672 connections established (including accepts) 17781 connections closed (including 2 drops) 0 connections updated cached RTT on close 0 connections updated cached RTT variance on close 0 connections updated cached ssthresh on close 0 embryonic connections dropped 51152 segments updated rtt (of 52750 attempts) 109 retransmit timeouts 0 connections dropped by rexmit timeout 0 persist timeouts 0 connections dropped by persist timeout 0 Connections (fin_wait_2) dropped because of timeout 0 keepalive timeouts 0 keepalive probes sent 0 connections dropped by keepalive 17474 correct ACK header predictions 4954 correct data packet header predictions 176 syncache entries added 0 retransmitted 0 dupsyn 0 dropped 176 completed 0 bucket overflow 0 cache overflow 0 reset 0 stale 0 aborted 0 badack 0 unreach 0 zone failures 176 cookies sent 0 cookies received 0 hostcache entries added 0 bucket overflow 0 SACK recovery episodes 0 segment rexmits in SACK recovery episodes 0 byte rexmits in SACK recovery episodes 0 SACK options (SACK blocks) received 0 SACK options (SACK blocks) sent 0 SACK scoreboard overflow 0 packets with ECN CE bit set 0 packets with ECN ECT(0) bit set 0 packets with ECN ECT(1) bit set 0 successful ECN handshakes 0 times ECN reduced the congestion window 0 times in ONTAP flow control 0 times exited ONTAP flow control 0 times in ONTAP flow control for zero send window 0 times in ONTAP flow control for non-zero send window 0 connection resets due to ONTAP extreme flow control 0 times in ONTAP extreme flow control 0 is the maximum flow control reset threshold reached during receive 0 is the maximum flow control reset threshold reached during send 0 bytes is send buffer value during last reset 0 bytes is send buffer hiwat mark during last reset 0 times the receive window was closed 0 dropped due to flowcontrol 0 segments sent using TSO 0 bytes sent using TSO 0 TSO segments truncated 0 TSO wrapped sequence space segments 0 segments sent using TSO6 0 bytes sent using TSO6 0 TSO6 segments truncated 0 TSO6 wrapped sequence space segments 0 recv upcalls batched in HP 0 recv upcalls made in HP 0 recv upcalls made in HP because of PSH 0 recv upcalls made in HP because of sb_hiwat 0 recv upcalls made in HP because of both PSH and sb_hiwat 0 recv upcall batch timeouts 0 times recv upcall read partial sb_cc in HP 0 segments received using LRO 0 bytes received using LRO 0 segments received using LRO6 0 bytes received using LRO6
Server tcp entries
[root@jwukccsbci ~]# sysctl -a | grep slot sunrpc.tcp_slot_table_entries = 128 sunrpc.udp_slot_table_entries = 128 dev.cdrom.info = drive # of slots: 1
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] Sent: 24 January 2018 11:53 To: Mark Saunders; Parisi, Justin; Fenn, Michael; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Could this be TCP slot tables? Flow control capabilities on ONTAP continue to improve. If you don't have TCP slot tables capped at 128 you could see quasi-hangs like this.
Complete details are in TR-3633, but these are the two that you want to watch:
[root@stlrx300s7-145 mkdb]# sysctl -a | grep slot sunrpc.tcp_max_slot_table_entries = 128 sunrpc.tcp_slot_table_entries = 128
Newer versions of linux will allow a ridiculous number of unacknowledged RPC operations to build up. The result can be sending ONTAP into a flow control mode until the OS catches up. We see problems mostly in slow clients. For example, if you're trying to read a lot of data from a host with 1Gb connectivity on a high-end ONTAP system the OS can ask for data quicker than it can process the responses.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com] Sent: Wednesday, January 24, 2018 12:26 PM To: Parisi, Justin <Justin.Parisi@netapp.commailto:Justin.Parisi@netapp.com>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Justin
I have just checked the SVM and there are no admin/management interfaces configured for it there are three data lifs for different vlans. I have checked through our other systems this morning and there are no issues in vmware (5.5) or SLES 11/12 so this is just with the redhat servers.
I have checked the interfaces at the server end and it is not showing errors or dropped packets. On the filer end we have 4 physical ports in an interface group with vlans on top. I have run “statistics start –obeject nfs_exports_access_cache” which when checked doesn’t report any errors.
On the server interface
eth1 Link encap:Ethernet HWaddr 00:50:56:A5:0D:6A inet addr:10.240.1.30 Bcast:10.240.1.31 Mask:255.255.255.224 inet6 addr: fe80::250:56ff:fea5:d6a/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:127209 errors:0 dropped:0 overruns:0 frame:0 TX packets:26100 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:104158360 (99.3 MiB) TX bytes:14489402 (13.8 MiB)
While investigating we have found that the file system is fine just after a reboot and you can ls each mount so they are initially all OK. It is when starting the application so putting a bigger load over the network that the file systems stop responding.
Regards
Mark
From: Parisi, Justin [mailto:Justin.Parisi@netapp.com] Sent: 23 January 2018 22:33 To: Steiner, Jeffrey; Mark Saunders; Fenn, Michael; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?
“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”
From: Parisi, Justin Sent: Tuesday, January 23, 2018 5:30 PM To: Parisi, Justin <Justin.Parisi@netapp.commailto:Justin.Parisi@netapp.com>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
This community post also does a good job explaining it:
https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgr...
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin Sent: Tuesday, January 23, 2018 5:28 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.
https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-...
The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey Sent: Tuesday, January 23, 2018 5:24 PM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
I should have asked - is this SAP HANA or something like SAP on an Oracle database?
Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.
The thing about fastpath does ring a few bells.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com] Sent: Tuesday, January 23, 2018 11:18 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.commailto:fennm@DEShawResearch.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.
I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)
A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.
The mount options were kept fairly straight forward as
nfs nolock,_netdev,udp 0 0
and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.
nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0
How would I be able to tell if we are using DNFS ?
I will send you the support details tomorrow when I am back in the office.
Regards
Mark
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] Sent: 23 January 2018 17:29 To: Fenn, Michael; Mark Saunders; toasters@teaparty.netmailto:toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.
I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.
From: Fenn, Michael [mailto:fennm@DEShawResearch.com] Sent: Tuesday, January 23, 2018 6:23 PM To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>; toasters@teaparty.netmailto:toasters@teaparty.net Subject: Re: NFS issue after upgrading filers to 9.2P2
The messages are not necessarily indicative of a network problem.
The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".
Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.
Thanks, Michael
From: <toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com> Date: Tuesday, January 23, 2018 at 10:38 AM To: Mark Saunders <Mark.Saunders@pcmsgroup.commailto:Mark.Saunders@pcmsgroup.com>, "toasters@teaparty.netmailto:toasters@teaparty.net" <toasters@teaparty.netmailto:toasters@teaparty.net> Subject: RE: NFS issue after upgrading filers to 9.2P2
Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.
I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.
I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.
What are the mount options used, and are you using DNFS?
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders Sent: Tuesday, January 23, 2018 4:29 PM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: NFS issue after upgrading filers to 9.2P2
Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.
In the server messages we get lots of entries for the SVM
Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:01:47 jwukccsbci last message repeated 5 times Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK Jan 23 07:02:07 jwukccsbci last message repeated 5 times Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying Jan 23 07:02:47 jwukccsbci last message repeated 5 times Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.
The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.
Regards Mark Data Centre Sysadmin Team Managed Services Phone:- 02476 694455 Ext 2567 The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~ The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.
The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
Jeffrey> Could this be TCP slot tables? Flow control capabilities on Jeffrey> ONTAP continue to improve. If you don't have TCP slot tables Jeffrey> capped at 128 you could see quasi-hangs like this.
But the problem was with UDP NFS traffic, right?
I've run into wierd problems in the past. I think we had a problem where the volumes holding the oracle tablespaces were mounted with "forcedirectio", but if we had any executables on there, they just wouldn't work and we'd have all kinds of problems.
Maybe it's something like that?
And just to confirm, other clients running newer RHEL versions, or SLES using the exact same interfaceces/IPs/mountpoint from the Netapp cluster don't show the problem?
Just because VMWare and SLES 11/12 aren't seeing the problem, doesn't mean you don't have some wierd configuration issue somewhere. What does "config advisor" say when run against your cluster?
Remember, change just one thing at atime, otherwise you're going to go mad. Of course I'm sure the business is jumping up and down screaming, which makes it hard to be methodical.
Good luck and let us know what you find please!
John
Well as is always the way with these random issues it is the simplest thing to fix the problem. After a test to mount the storage on one of the SLES servers which worked fine we have changed the redhat server mounts to tcp as that was what we used for the SLES mounts and the file system is fine and all databases have started up and been running for a number of hours with no problems.
I have chased down the person who set the servers up and he was trying different options to see what gave the best performance and left udp in the options as while he wasn't sure that it was any better it hadn't got any worse.
Thank you to everyone with the quick replies as if I had been waiting on the ticket I logged I would be no further forward that Tuesday morning.
-----Original Message----- From: John Stoffel [mailto:john@stoffel.org] Sent: 24 January 2018 17:20 To: Steiner, Jeffrey Cc: Mark Saunders; Parisi, Justin; Fenn, Michael; toasters@teaparty.net Subject: RE: NFS issue after upgrading filers to 9.2P2
Jeffrey> Could this be TCP slot tables? Flow control capabilities on Jeffrey> ONTAP continue to improve. If you don't have TCP slot tables Jeffrey> capped at 128 you could see quasi-hangs like this.
But the problem was with UDP NFS traffic, right?
I've run into wierd problems in the past. I think we had a problem where the volumes holding the oracle tablespaces were mounted with "forcedirectio", but if we had any executables on there, they just wouldn't work and we'd have all kinds of problems.
Maybe it's something like that?
And just to confirm, other clients running newer RHEL versions, or SLES using the exact same interfaceces/IPs/mountpoint from the Netapp cluster don't show the problem?
Just because VMWare and SLES 11/12 aren't seeing the problem, doesn't mean you don't have some wierd configuration issue somewhere. What does "config advisor" say when run against your cluster?
Remember, change just one thing at atime, otherwise you're going to go mad. Of course I'm sure the business is jumping up and down screaming, which makes it hard to be methodical.
Good luck and let us know what you find please!
John
Mark> Well as is always the way with these random issues it is the Mark> simplest thing to fix the problem. After a test to mount the Mark> storage on one of the SLES servers which worked fine we have Mark> changed the redhat server mounts to tcp as that was what we used Mark> for the SLES mounts and the file system is fine and all Mark> databases have started up and been running for a number of hours Mark> with no problems.
That's good to hear! The takeaway I have from this is that NFS over UDP is not something you should ever be using.
Mark> I have chased down the person who set the servers up and he was Mark> trying different options to see what gave the best performance Mark> and left udp in the options as while he wasn't sure that it was Mark> any better it hadn't got any worse.
I'm curious about how they did their testing? And what the cutoff was for making changes and whether it was worth keeping or not. All the docs I've read from Netapp and Oracle say to use NFS over tcp, with large read/write block sizes and then some other options in special cases.
In my mind, the advantages of TCP over UDP even for regular NFS traffic make it a no brainer.
Mark> Thank you to everyone with the quick replies as if I had been Mark> waiting on the ticket I logged I would be no further forward Mark> that Tuesday morning.
This is why I love this mailing list, so many helpful people on here.
John
On Thu, Jan 25, 2018 at 08:44:00AM -0500, John Stoffel wrote:
That's good to hear! The takeaway I have from this is that NFS over UDP is not something you should ever be using.
Mark> I have chased down the person who set the servers up and he was Mark> trying different options to see what gave the best performance Mark> and left udp in the options as while he wasn't sure that it was Mark> any better it hadn't got any worse.
I'm curious about how they did their testing? And what the cutoff was for making changes and whether it was worth keeping or not. All the docs I've read from Netapp and Oracle say to use NFS over tcp, with large read/write block sizes and then some other options in special cases.
In my mind, the advantages of TCP over UDP even for regular NFS traffic make it a no brainer.
Hmmmm..... disagree.
In the best of all possible worlds UDP wins. Its fast and you can overlap multiple reads and write much easier than TCP. Those guys who invented NFS used it for a reason. If I wanted raw performance I would use UDP.
However, in lots of cases UDP has problems. Network devices are often optimized for TCP (firewalls are a prime example) and as you say packet sizes can be larger with TCP.
I agree that TCP is a better bet in general but I do understand why people may want to use UDP.
Its interesting that the new data transfer algorithms seem to be UDP based. Aspera for example. I wonder if those types of protocols could make sense for NFS?
Regards, pdg
"Peter" == Peter D Gray pdg@uow.edu.au writes:
Peter> On Thu, Jan 25, 2018 at 08:44:00AM -0500, John Stoffel wrote:
That's good to hear! The takeaway I have from this is that NFS over UDP is not something you should ever be using.
Mark> I have chased down the person who set the servers up and he was Mark> trying different options to see what gave the best performance Mark> and left udp in the options as while he wasn't sure that it was Mark> any better it hadn't got any worse.
I'm curious about how they did their testing? And what the cutoff was for making changes and whether it was worth keeping or not. All the docs I've read from Netapp and Oracle say to use NFS over tcp, with large read/write block sizes and then some other options in special cases.
In my mind, the advantages of TCP over UDP even for regular NFS traffic make it a no brainer.
Peter> Hmmmm..... disagree.
Peter> In the best of all possible worlds UDP wins. Its fast and you Peter> can overlap multiple reads and write much easier than Peter> TCP. Those guys who invented NFS used it for a reason. If I Peter> wanted raw performance I would use UDP.
They used UDP at the time because computers and networks were *slow* and the TCP overhead was much higher then, esp since they mostly had hubs back then. Under contention, NFS over TCP would slow way down. I would agrue that this is a false economy today when we have 10g networks. *grin*
Peter> However, in lots of cases UDP has problems. Network devices Peter> are often optimized for TCP (firewalls are a prime example) and Peter> as you say packet sizes can be larger with TCP.
Exactly. Chasing a few percent of speed (or even 10%!) by using UDP is not a great idea. Esp since I suspect that NFS over UDP is a much less tested version of the protocol these days.
Peter> I agree that TCP is a better bet in general but I do understand Peter> why people may want to use UDP.
Peter> Its interesting that the new data transfer algorithms seem to Peter> be UDP based. Aspera for example. I wonder if those types of Peter> protocols could make sense for NFS?
If you're willing to do your own congestion control and packet handling, then sure it can make sense. Esp if you're working over a WAN link and you don't mind out of order packets and can handle it better in your own server software. But how many filesystems are doing this? Esp for POSIX compatibility?
John
On 30 Jan 2018, at 7:57 am, John Stoffel <john@stoffel.orgmailto:john@stoffel.org> wrote:
They used UDP at the time because computers and networks were *slow* and the TCP overhead was much higher then, esp since they mostly had hubs back then. Under contention, NFS over TCP would slow way down. I would agrue that this is a false economy today when we have 10g networks. *grin*
Indeed. NFS (over UDP) was invented when Ethernet meant a thick coaxial cable running around the building, shared between all machines, and the speed was 10Mb/s (that’s bits not bytes). Processor speeds were typically 10-20MHz in high end servers. I was around then.
Nowadays, network hardware is all optimised for TCP, whereas there is not much you can do with UDP without being aware of the application layer (7).
TCP offload engines in the network interfaces handle packet assembly/disassembly, checksum computations and other things. Network switches and routers can optimise TCP traffic.
UDP still has its place for things like VPN, media streaming and specialised applications like Aspera. But it doesn’t make sense for standard applications and certainly not file sharing, where data integrity is paramount.
Jeremy
-- Jeremy Webber Senior Systems Engineer
T: +61 2 9383 4800 (main) D: +61 2 8310 3577 (direct) E: Jeremy.Webber@al.com.au
Building 54 / FSA #19, Fox Studios Australia, 38 Driver Avenue Moore Park, NSW 2021 AUSTRALIA
[LinkedIn] https://www.linkedin.com/company/animal-logic [Facebook] https://www.facebook.com/Animal-Logic-129284263808191/ [Twitter] https://twitter.com/AnimalLogic [Instagram] https://www.instagram.com/animallogicstudios/ [Animal Logic]http://www.peterrabbit-movie.com
Check out our awesome NEW website www.animallogic.comhttp://www.animallogic.com
CONFIDENTIALITY AND PRIVILEGE NOTICE This email is intended only to be read or used by the addressee. It is confidential and may contain privileged information. If you are not the intended recipient, any use, distribution, disclosure or copying of this email is strictly prohibited. Confidentiality and legal privilege attached to this communication are not waived or lost by reason of the mistaken delivery to you. If you have received this email in error, please delete it and notify us immediately by telephone or email.
Oracle DNFS. Requires extra setup and bypasses the system NFS. Oracle built their own that is pretty efficient.
Just out of curiosity, take a look at your Pause Frames (Xon/Xoff). Look on the NetApp side and the client side. (netapp -> system node run -node node-0x ifconfig e0a) Client side will depend, but you want to see the eth stats. If it is possible, even check the ports on the switch.
Maybe they are being generated more frequently after the upgrade?
As Justin suggested, looking at a packet trace from both ends would be helpful.
--tmac
*Tim McCarthy, **Principal Consultant*
*Proud Member of the #NetAppATeam https://twitter.com/NetAppATeam*
*I Blog at TMACsRack https://tmacsrack.wordpress.com/*
On Tue, Jan 23, 2018 at 5:17 PM, Mark Saunders Mark.Saunders@pcmsgroup.com wrote:
Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.
I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)
A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.
The mount options were kept fairly straight forward as
nfs nolock,_netdev,udp 0 0
and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.
nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0
How would I be able to tell if we are using DNFS ?
I will send you the support details tomorrow when I am back in the office.
Regards
Mark
*From:* Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com] *Sent:* 23 January 2018 17:29 *To:* Fenn, Michael; Mark Saunders; toasters@teaparty.net
*Subject:* RE: NFS issue after upgrading filers to 9.2P2
It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.
I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.
*From:* Fenn, Michael [mailto:fennm@DEShawResearch.com] *Sent:* Tuesday, January 23, 2018 6:23 PM *To:* Steiner, Jeffrey Jeffrey.Steiner@netapp.com; Mark Saunders < Mark.Saunders@pcmsgroup.com>; toasters@teaparty.net *Subject:* Re: NFS issue after upgrading filers to 9.2P2
The messages are not necessarily indicative of a network problem.
The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".
Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.
Thanks,
Michael
*From: *toasters-bounces@teaparty.net on behalf of "Steiner, Jeffrey" < Jeffrey.Steiner@netapp.com> *Date: *Tuesday, January 23, 2018 at 10:38 AM *To: *Mark Saunders Mark.Saunders@pcmsgroup.com, "toasters@teaparty.net" toasters@teaparty.net *Subject: *RE: NFS issue after upgrading filers to 9.2P2
Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.
I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.
I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.
What are the mount options used, and are you using DNFS?
*From:* toasters-bounces@teaparty.net [mailto:toasters-bounces@ teaparty.net toasters-bounces@teaparty.net] *On Behalf Of *Mark Saunders *Sent:* Tuesday, January 23, 2018 4:29 PM *To:* toasters@teaparty.net *Subject:* NFS issue after upgrading filers to 9.2P2
Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.
In the server messages we get lots of entries for the SVM
Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:01:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Jan 23 07:02:07 jwukccsbci last message repeated 5 times
Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:02:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.
The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.
Regards
Mark
Data Centre Sysadmin Team
Managed Services
Phone:- 02476 694455 Ext 2567
The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~
The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.
The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters