After a bit of an email search it was this bug

https://bugzilla.redhat.com/show_bug.cgi?id=321111

From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 24 January 2018 12:07
To: Mark Saunders; Parisi, Justin; Fenn, Michael; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

If that's 441463, I'm skeptical that's the problem. That might cause problems during boot, but I wouldn’t expect it to cause problems later. Also, an ONTAP upgrade shouldn't affect this.

I'll subscribe to the case and follow along. The stats below do show some possible problems. There was some flow control activity, and the SACK numbers look high to me.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Wednesday, January 24, 2018 1:02 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Parisi, Justin <Justin.Parisi@netapp.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

I will try to find the kernel bug number as I cant see it in the documentation for the server there is just the following note.

RHEL 5.11 has a bug where NFS mounts mounted after network initialization at boot run with an increased number of TCP requests (approx 10x more) which causes rpc backlog and restricts network throughput on the NFS mounts.

To resolve this a script has been created to restart the networking before the NFS mounts are mounted by netfs at boot. By default netfs runs at boot s25 on runlevel 3, 4 and 5 so we will set the NFS fix to run at s24 on the same run levels.

PGUKCSTGCL01::*> node run -node PGUKCSTGCL01-01 -command netstat -sp tcp

---- Default IPSpace ----

tcp:

900103907 packets sent

476280230 data packets (4676048494764 bytes)

61984 data packets (82328048 bytes) retransmitted

2065 data packets unnecessarily retransmitted

0 resends initiated by MTU discovery

235945463 ack-only packets (517654 delayed)

0 URG only packets

0 window probe packets

187429557 window update packets

333130 control packets

1097649475 packets received

399065895 acks (for 4676054895668 bytes)

2174268 duplicate acks

0 acks for unsent data

723809875 packets (4886339861169 bytes) received in-sequence

1649638 completely duplicate packets (98637034 bytes)

2 old duplicate packets

990 packets with some dup. data (214519 bytes duped)

10872239 out-of-order packets (15192422547 bytes)

0 packets (0 bytes) of data after window

0 window probes

26845 window update packets

2 packets received after close

0 discarded for bad checksums

0 discarded for bad header offset fields

0 discarded because packet too short

37581 discarded due to memory problems

1441 connection requests

412966 connection accepts

0 bad connection attempts

0 listen queue overflows

305109 ignored RSTs in the windows

414403 connections established (including accepts)

443890 connections closed (including 139609 drops)

151376 connections updated cached RTT on close

151388 connections updated cached RTT variance on close

140203 connections updated cached ssthresh on close

0 embryonic connections dropped

388403781 segments updated rtt (of 258539924 attempts)

6843 retransmit timeouts

11 connections dropped by rexmit timeout

3 persist timeouts

0 connections dropped by persist timeout

0 Connections (fin_wait_2) dropped because of timeout

92323 keepalive timeouts

92323 keepalive probes sent

0 connections dropped by keepalive

351415606 correct ACK header predictions

684179955 correct data packet header predictions

412966 syncache entries added

155 retransmitted

302 dupsyn

0 dropped

412966 completed

0 bucket overflow

0 cache overflow

0 reset

0 stale

0 aborted

0 badack

0 unreach

0 zone failures

412966 cookies sent

0 cookies received

112 hostcache entries added

0 bucket overflow

16181 SACK recovery episodes

51541 segment rexmits in SACK recovery episodes

70735551 byte rexmits in SACK recovery episodes

277116 SACK options (SACK blocks) received

11457931 SACK options (SACK blocks) sent

0 SACK scoreboard overflow

0 packets with ECN CE bit set

0 packets with ECN ECT(0) bit set

0 packets with ECN ECT(1) bit set

0 successful ECN handshakes

0 times ECN reduced the congestion window

251543 times in ONTAP flow control

0 times exited ONTAP flow control

0 times in ONTAP flow control for zero send window

251543 times in ONTAP flow control for non-zero send window

0 connection resets due to ONTAP extreme flow control

0 times in ONTAP extreme flow control

0 is the maximum flow control reset threshold reached during receive

4 is the maximum flow control reset threshold reached during send

0 bytes is send buffer value during last reset

0 bytes is send buffer hiwat mark during last reset

79 times the receive window was closed

44 dropped due to flowcontrol

188382441 segments sent using TSO

4595103991390 bytes sent using TSO

73883767 TSO segments truncated

1069 TSO wrapped sequence space segments

0 segments sent using TSO6

0 bytes sent using TSO6

0 TSO6 segments truncated

0 TSO6 wrapped sequence space segments

366670238 recv upcalls batched in HP

302647105 recv upcalls made in HP

296877004 recv upcalls made in HP because of PSH

2291336 recv upcalls made in HP because of sb_hiwat

3481239 recv upcalls made in HP because of both PSH and sb_hiwat

6733214 recv upcall batch timeouts

16594187 times recv upcall read partial sb_cc in HP

631681762 segments received using LRO

4816721023400 bytes received using LRO

0 segments received using LRO6

0 bytes received using LRO6

---- ANYVSERVER IPSpace ----

tcp:

0 packets sent

0 data packets (0 bytes)

0 data packets (0 bytes) retransmitted

0 data packets unnecessarily retransmitted

0 resends initiated by MTU discovery

0 ack-only packets (0 delayed)

0 URG only packets

0 window probe packets

0 window update packets

0 control packets

0 packets received

0 acks (for 0 bytes)

0 duplicate acks

0 acks for unsent data

0 packets (0 bytes) received in-sequence

0 completely duplicate packets (0 bytes)

0 old duplicate packets

0 packets with some dup. data (0 bytes duped)

0 out-of-order packets (0 bytes)

0 packets (0 bytes) of data after window

0 window probes

0 window update packets

0 packets received after close

0 discarded for bad checksums

0 discarded for bad header offset fields

0 discarded because packet too short

0 discarded due to memory problems

0 connection requests

0 connection accepts

0 bad connection attempts

0 listen queue overflows

0 ignored RSTs in the windows

0 connections established (including accepts)

7 connections closed (including 0 drops)

0 connections updated cached RTT on close

0 connections updated cached RTT variance on close

0 connections updated cached ssthresh on close

0 embryonic connections dropped

0 segments updated rtt (of 0 attempts)

0 retransmit timeouts

0 connections dropped by rexmit timeout

0 persist timeouts

0 connections dropped by persist timeout

0 Connections (fin_wait_2) dropped because of timeout

0 keepalive timeouts

0 keepalive probes sent

0 connections dropped by keepalive

0 correct ACK header predictions

0 correct data packet header predictions

0 syncache entries added

0 retransmitted

0 dupsyn

0 dropped

0 completed

0 bucket overflow

0 cache overflow

0 reset

0 stale

0 aborted

0 badack

0 unreach

0 zone failures

0 cookies sent

0 cookies received

0 hostcache entries added

0 bucket overflow

0 SACK recovery episodes

0 segment rexmits in SACK recovery episodes

0 byte rexmits in SACK recovery episodes

0 SACK options (SACK blocks) received

0 SACK options (SACK blocks) sent

0 SACK scoreboard overflow

0 packets with ECN CE bit set

0 packets with ECN ECT(0) bit set

0 packets with ECN ECT(1) bit set

0 successful ECN handshakes

0 times ECN reduced the congestion window

0 times in ONTAP flow control

0 times exited ONTAP flow control

0 times in ONTAP flow control for zero send window

0 times in ONTAP flow control for non-zero send window

0 connection resets due to ONTAP extreme flow control

0 times in ONTAP extreme flow control

0 is the maximum flow control reset threshold reached during receive

0 is the maximum flow control reset threshold reached during send

0 bytes is send buffer value during last reset

0 bytes is send buffer hiwat mark during last reset

0 times the receive window was closed

0 dropped due to flowcontrol

0 segments sent using TSO

0 bytes sent using TSO

0 TSO segments truncated

0 TSO wrapped sequence space segments

0 segments sent using TSO6

0 bytes sent using TSO6

0 TSO6 segments truncated

0 TSO6 wrapped sequence space segments

0 recv upcalls batched in HP

0 recv upcalls made in HP

0 recv upcalls made in HP because of PSH

0 recv upcalls made in HP because of sb_hiwat

0 recv upcalls made in HP because of both PSH and sb_hiwat

0 recv upcall batch timeouts

0 times recv upcall read partial sb_cc in HP

0 segments received using LRO

0 bytes received using LRO

0 segments received using LRO6

0 bytes received using LRO6

---- Cluster IPSpace ----

tcp:

350960787 packets sent

253625385 data packets (2042642509989 bytes)

11525 data packets (120517203 bytes) retransmitted

63 data packets unnecessarily retransmitted

0 resends initiated by MTU discovery

38550609 ack-only packets (15348627 delayed)

0 URG only packets

1 window probe packet

56728197 window update packets

2035396 control packets

341097715 packets received

224460892 acks (for 2042726150883 bytes)

6840725 duplicate acks

0 acks for unsent data

271870811 packets (3031038679110 bytes) received in-sequence

195650 completely duplicate packets (4506 bytes)

49 old duplicate packets

0 packets with some dup. data (0 bytes duped)

205398 out-of-order packets (565766073 bytes)

0 packets (0 bytes) of data after window

0 window probes

2011210 window update packets

123 packets received after close

0 discarded for bad checksums

0 discarded for bad header offset fields

0 discarded because packet too short

0 discarded due to memory problems

923539 connection requests

456892 connection accepts

0 bad connection attempts

0 listen queue overflows

529 ignored RSTs in the windows

1271558 connections established (including accepts)

1379180 connections closed (including 1101 drops)

369895 connections updated cached RTT on close

370750 connections updated cached RTT variance on close

12122 connections updated cached ssthresh on close

108207 embryonic connections dropped

224454663 segments updated rtt (of 207849890 attempts)

48471 retransmit timeouts

14 connections dropped by rexmit timeout

1 persist timeout

0 connections dropped by persist timeout

0 Connections (fin_wait_2) dropped because of timeout

152128 keepalive timeouts

147328 keepalive probes sent

4800 connections dropped by keepalive

45057764 correct ACK header predictions

104981779 correct data packet header predictions

457057 syncache entries added

61 retransmitted

0 dupsyn

0 dropped

456892 completed

0 bucket overflow

0 cache overflow

165 reset

0 stale

0 aborted

0 badack

0 unreach

0 zone failures

457057 cookies sent

0 cookies received

61 hostcache entries added

0 bucket overflow

1684 SACK recovery episodes

2491 segment rexmits in SACK recovery episodes

5618157 byte rexmits in SACK recovery episodes

17518 SACK options (SACK blocks) received

86946 SACK options (SACK blocks) sent

0 SACK scoreboard overflow

0 packets with ECN CE bit set

0 packets with ECN ECT(0) bit set

0 packets with ECN ECT(1) bit set

0 successful ECN handshakes

0 times ECN reduced the congestion window

0 times in ONTAP flow control

0 times exited ONTAP flow control

0 times in ONTAP flow control for zero send window

0 times in ONTAP flow control for non-zero send window

0 connection resets due to ONTAP extreme flow control

0 times in ONTAP extreme flow control

0 is the maximum flow control reset threshold reached during receive

0 is the maximum flow control reset threshold reached during send

0 bytes is send buffer value during last reset

0 bytes is send buffer hiwat mark during last reset

0 times the receive window was closed

0 dropped due to flowcontrol

56607835 segments sent using TSO

1679494142753 bytes sent using TSO

36473474 TSO segments truncated

394 TSO wrapped sequence space segments

0 segments sent using TSO6

0 bytes sent using TSO6

0 TSO6 segments truncated

0 TSO6 wrapped sequence space segments

4879278 recv upcalls batched in HP

90401291 recv upcalls made in HP

90401967 recv upcalls made in HP because of PSH

52 recv upcalls made in HP because of sb_hiwat

325 recv upcalls made in HP because of both PSH and sb_hiwat

32882 recv upcall batch timeouts

524 times recv upcall read partial sb_cc in HP

160827213 segments received using LRO

2789346524807 bytes received using LRO

0 segments received using LRO6

0 bytes received using LRO6

---- ips_4294967289 IPSpace ----

tcp:

0 packets sent

0 data packets (0 bytes)

0 data packets (0 bytes) retransmitted

0 data packets unnecessarily retransmitted

0 resends initiated by MTU discovery

0 ack-only packets (0 delayed)

0 URG only packets

0 window probe packets

0 window update packets

0 control packets

0 packets received

0 acks (for 0 bytes)

0 duplicate acks

0 acks for unsent data

0 packets (0 bytes) received in-sequence

0 completely duplicate packets (0 bytes)

0 old duplicate packets

0 packets with some dup. data (0 bytes duped)

0 out-of-order packets (0 bytes)

0 packets (0 bytes) of data after window

0 window probes

0 window update packets

0 packets received after close

0 discarded for bad checksums

0 discarded for bad header offset fields

0 discarded because packet too short

0 discarded due to memory problems

0 connection requests

0 connection accepts

0 bad connection attempts

0 listen queue overflows

0 ignored RSTs in the windows

0 connections established (including accepts)

0 connections closed (including 0 drops)

0 connections updated cached RTT on close

0 connections updated cached RTT variance on close

0 connections updated cached ssthresh on close

0 embryonic connections dropped

0 segments updated rtt (of 0 attempts)

0 retransmit timeouts

0 connections dropped by rexmit timeout

0 persist timeouts

0 connections dropped by persist timeout

0 Connections (fin_wait_2) dropped because of timeout

0 keepalive timeouts

0 keepalive probes sent

0 connections dropped by keepalive

0 correct ACK header predictions

0 correct data packet header predictions

0 syncache entries added

0 retransmitted

0 dupsyn

0 dropped

0 completed

0 bucket overflow

0 cache overflow

0 reset

0 stale

0 aborted

0 badack

0 unreach

0 zone failures

0 cookies sent

0 cookies received

0 hostcache entries added

0 bucket overflow

0 SACK recovery episodes

0 segment rexmits in SACK recovery episodes

0 byte rexmits in SACK recovery episodes

0 SACK options (SACK blocks) received

0 SACK options (SACK blocks) sent

0 SACK scoreboard overflow

0 packets with ECN CE bit set

0 packets with ECN ECT(0) bit set

0 packets with ECN ECT(1) bit set

0 successful ECN handshakes

0 times ECN reduced the congestion window

0 times in ONTAP flow control

0 times exited ONTAP flow control

0 times in ONTAP flow control for zero send window

0 times in ONTAP flow control for non-zero send window

0 connection resets due to ONTAP extreme flow control

0 times in ONTAP extreme flow control

0 is the maximum flow control reset threshold reached during receive

0 is the maximum flow control reset threshold reached during send

0 bytes is send buffer value during last reset

0 bytes is send buffer hiwat mark during last reset

0 times the receive window was closed

0 dropped due to flowcontrol

0 segments sent using TSO

0 bytes sent using TSO

0 TSO segments truncated

0 TSO wrapped sequence space segments

0 segments sent using TSO6

0 bytes sent using TSO6

0 TSO6 segments truncated

0 TSO6 wrapped sequence space segments

0 recv upcalls batched in HP

0 recv upcalls made in HP

0 recv upcalls made in HP because of PSH

0 recv upcalls made in HP because of sb_hiwat

0 recv upcalls made in HP because of both PSH and sb_hiwat

0 recv upcall batch timeouts

0 times recv upcall read partial sb_cc in HP

0 segments received using LRO

0 bytes received using LRO

0 segments received using LRO6

0 bytes received using LRO6

---- ACP IPSpace ----

tcp:

86643 packets sent

17496 data packets (419904 bytes)

0 data packets (0 bytes) retransmitted

0 data packets unnecessarily retransmitted

0 resends initiated by MTU discovery

33848 ack-only packets (0 delayed)

0 URG only packets

0 window probe packets

23 window update packets

35276 control packets

74406 packets received

51152 acks (for 436064 bytes)

4798 duplicate acks

0 acks for unsent data

20938 packets (1251746 bytes) received in-sequence

0 completely duplicate packets (0 bytes)

0 old duplicate packets

0 packets with some dup. data (0 bytes duped)

0 out-of-order packets (0 bytes)

0 packets (0 bytes) of data after window

0 window probes

0 window update packets

1686 packets received after close

0 discarded for bad checksums

0 discarded for bad header offset fields

0 discarded because packet too short

0 discarded due to memory problems

17605 connection requests

176 connection accepts

0 bad connection attempts

0 listen queue overflows

0 ignored RSTs in the windows

17672 connections established (including accepts)

17781 connections closed (including 2 drops)

0 connections updated cached RTT on close

0 connections updated cached RTT variance on close

0 connections updated cached ssthresh on close

0 embryonic connections dropped

51152 segments updated rtt (of 52750 attempts)

109 retransmit timeouts

0 connections dropped by rexmit timeout

0 persist timeouts

0 connections dropped by persist timeout

0 Connections (fin_wait_2) dropped because of timeout

0 keepalive timeouts

0 keepalive probes sent

0 connections dropped by keepalive

17474 correct ACK header predictions

4954 correct data packet header predictions

176 syncache entries added

0 retransmitted

0 dupsyn

0 dropped

176 completed

0 bucket overflow

0 cache overflow

0 reset

0 stale

0 aborted

0 badack

0 unreach

0 zone failures

176 cookies sent

0 cookies received

0 hostcache entries added

0 bucket overflow

0 SACK recovery episodes

0 segment rexmits in SACK recovery episodes

0 byte rexmits in SACK recovery episodes

0 SACK options (SACK blocks) received

0 SACK options (SACK blocks) sent

0 SACK scoreboard overflow

0 packets with ECN CE bit set

0 packets with ECN ECT(0) bit set

0 packets with ECN ECT(1) bit set

0 successful ECN handshakes

0 times ECN reduced the congestion window

0 times in ONTAP flow control

0 times exited ONTAP flow control

0 times in ONTAP flow control for zero send window

0 times in ONTAP flow control for non-zero send window

0 connection resets due to ONTAP extreme flow control

0 times in ONTAP extreme flow control

0 is the maximum flow control reset threshold reached during receive

0 is the maximum flow control reset threshold reached during send

0 bytes is send buffer value during last reset

0 bytes is send buffer hiwat mark during last reset

0 times the receive window was closed

0 dropped due to flowcontrol

0 segments sent using TSO

0 bytes sent using TSO

0 TSO segments truncated

0 TSO wrapped sequence space segments

0 segments sent using TSO6

0 bytes sent using TSO6

0 TSO6 segments truncated

0 TSO6 wrapped sequence space segments

0 recv upcalls batched in HP

0 recv upcalls made in HP

0 recv upcalls made in HP because of PSH

0 recv upcalls made in HP because of sb_hiwat

0 recv upcalls made in HP because of both PSH and sb_hiwat

0 recv upcall batch timeouts

0 times recv upcall read partial sb_cc in HP

0 segments received using LRO

0 bytes received using LRO

0 segments received using LRO6

0 bytes received using LRO6

Server tcp entries

[root@jwukccsbci ~]# sysctl -a | grep slot

sunrpc.tcp_slot_table_entries = 128

sunrpc.udp_slot_table_entries = 128

dev.cdrom.info = drive # of slots: 1

From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 24 January 2018 11:53
To: Mark Saunders; Parisi, Justin; Fenn, Michael; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

Could this be TCP slot tables? Flow control capabilities on ONTAP continue to improve. If you don't have TCP slot tables capped at 128 you could see quasi-hangs like this.

Complete details are in TR-3633, but these are the two that you want to watch:

[root@stlrx300s7-145 mkdb]# sysctl -a | grep slot

sunrpc.tcp_max_slot_table_entries = 128

sunrpc.tcp_slot_table_entries = 128

Newer versions of linux will allow a ridiculous number of unacknowledged RPC operations to build up. The result can be sending ONTAP into a flow control mode until the OS catches up. We see problems mostly in slow clients. For example, if you're trying to read a lot of data from a host with 1Gb connectivity on a high-end ONTAP system the OS can ask for data quicker than it can process the responses.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Wednesday, January 24, 2018 12:26 PM
To: Parisi, Justin <Justin.Parisi@netapp.com>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

Justin

I have just checked the SVM and there are no admin/management interfaces configured for it there are three data lifs for different vlans. I have checked through our other systems this morning and there are no issues in vmware (5.5) or SLES 11/12 so this is just with the redhat servers.

I have checked the interfaces at the server end and it is not showing errors or dropped packets. On the filer end we have 4 physical ports in an interface group with vlans on top. I have run “statistics start –obeject nfs_exports_access_cache” which when checked doesn’t report any errors.

On the server interface

eth1 Link encap:Ethernet HWaddr 00:50:56:A5:0D:6A

inet addr:10.240.1.30 Bcast:10.240.1.31 Mask:255.255.255.224

inet6 addr: fe80::250:56ff:fea5:d6a/64 Scope:Link

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

RX packets:127209 errors:0 dropped:0 overruns:0 frame:0

TX packets:26100 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:104158360 (99.3 MiB) TX bytes:14489402 (13.8 MiB)

While investigating we have found that the file system is fine just after a reboot and you can ls each mount so they are initially all OK. It is when starting the application so putting a bigger load over the network that the file systems stop responding.

Regards

Mark

From: Parisi, Justin [mailto:Justin.Parisi@netapp.com]
Sent: 23 January 2018 22:33
To: Steiner, Jeffrey; Mark Saunders; Fenn, Michael; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?

“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”

From: Parisi, Justin
Sent: Tuesday, January 23, 2018 5:30 PM
To: Parisi, Justin <Justin.Parisi@netapp.com>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

This community post also does a good job explaining it:

https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgrade-review-your-network-first/td-p/136657

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin
Sent: Tuesday, January 23, 2018 5:28 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.

https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-9EDC-868C5292381B.html

The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey
Sent: Tuesday, January 23, 2018 5:24 PM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

I should have asked - is this SAP HANA or something like SAP on an Oracle database?

Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.

The thing about fastpath does ring a few bells.

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Tuesday, January 23, 2018 11:18 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.

I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)

A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.

The mount options were kept fairly straight forward as

nfs nolock,_netdev,udp 0 0

and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.

nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0

How would I be able to tell if we are using DNFS ?

I will send you the support details tomorrow when I am back in the office.

Regards

Mark

From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 23 January 2018 17:29
To: Fenn, Michael; Mark Saunders; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.

I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.

From: Fenn, Michael [mailto:fennm@DEShawResearch.com]
Sent: Tuesday, January 23, 2018 6:23 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.com>; toasters@teaparty.net
Subject: Re: NFS issue after upgrading filers to 9.2P2

The messages are not necessarily indicative of a network problem.

The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs: server … OK".

Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.

Thanks,

Michael

From: <toasters-bounces@teaparty.net> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.com>
Date: Tuesday, January 23, 2018 at 10:38 AM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com>, "toasters@teaparty.net" <toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.

I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.

I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.

What are the mount options used, and are you using DNFS?

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders
Sent: Tuesday, January 23, 2018 4:29 PM
To: toasters@teaparty.net
Subject: NFS issue after upgrading filers to 9.2P2

Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.

In the server messages we get lots of entries for the SVM

Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying

Jan 23 07:01:47 jwukccsbci last message repeated 5 times

Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

Jan 23 07:02:07 jwukccsbci last message repeated 5 times

Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying

Jan 23 07:02:47 jwukccsbci last message repeated 5 times

Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.

The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.

Regards

Mark

Data Centre Sysadmin Team

Managed Services

Phone:- 02476 694455 Ext 2567

The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~

The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves the right to monitor and examine the content of all e-mails.

The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743