After a bit of an email search it was this bug
https://bugzilla.redhat.com/show_bug.cgi?id=321111
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 24 January 2018 12:07
To: Mark Saunders; Parisi, Justin; Fenn, Michael; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2
If that's 441463, I'm skeptical that's the problem. That might cause problems during boot, but I wouldn’t expect it to cause problems later. Also, an ONTAP upgrade shouldn't affect this.
I'll subscribe to the case and follow along. The stats below do show some possible problems. There was some flow control activity, and the SACK numbers look high to me.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Wednesday, January 24, 2018 1:02 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Parisi, Justin <Justin.Parisi@netapp.com>; Fenn, Michael <fennm@DEShawResearch.com>;
toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2
I will try to find the kernel bug number as I cant see it in the documentation for the server there is just the following note.
RHEL 5.11 has a bug where NFS mounts mounted after network initialization at boot run with an increased number of TCP requests (approx 10x more) which causes rpc backlog and restricts
network throughput on the NFS mounts.
To resolve this a script has been created to restart the networking before the NFS mounts are mounted by netfs at boot. By default netfs runs at boot s25 on runlevel 3, 4 and 5 so we
will set the NFS fix to run at s24 on the same run levels.
PGUKCSTGCL01::*> node run -node PGUKCSTGCL01-01 -command netstat -sp tcp
---- Default IPSpace ----
tcp:
900103907 packets sent
476280230 data packets (4676048494764 bytes)
61984 data packets (82328048 bytes) retransmitted
2065 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
235945463 ack-only packets (517654 delayed)
0 URG only packets
0 window probe packets
187429557 window update packets
333130 control packets
1097649475 packets received
399065895 acks (for 4676054895668 bytes)
2174268 duplicate acks
0 acks for unsent data
723809875 packets (4886339861169 bytes) received in-sequence
1649638 completely duplicate packets (98637034 bytes)
2 old duplicate packets
990 packets with some dup. data (214519 bytes duped)
10872239 out-of-order packets (15192422547 bytes)
0 packets (0 bytes) of data after window
0 window probes
26845 window update packets
2 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
37581 discarded due to memory problems
1441 connection requests
412966 connection accepts
0 bad connection attempts
0 listen queue overflows
305109 ignored RSTs in the windows
414403 connections established (including accepts)
443890 connections closed (including 139609 drops)
151376 connections updated cached RTT on close
151388 connections updated cached RTT variance on close
140203 connections updated cached ssthresh on close
0 embryonic connections dropped
388403781 segments updated rtt (of 258539924 attempts)
6843 retransmit timeouts
11 connections dropped by rexmit timeout
3 persist timeouts
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
92323 keepalive timeouts
92323 keepalive probes sent
0 connections dropped by keepalive
351415606 correct ACK header predictions
684179955 correct data packet header predictions
412966 syncache entries added
155 retransmitted
302 dupsyn
0 dropped
412966 completed
0 bucket overflow
0 cache overflow
0 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
412966 cookies sent
0 cookies received
112 hostcache entries added
0 bucket overflow
16181 SACK recovery episodes
51541 segment rexmits in SACK recovery episodes
70735551 byte rexmits in SACK recovery episodes
277116 SACK options (SACK blocks) received
11457931 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
251543 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
251543 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
4 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
79 times the receive window was closed
44 dropped due to flowcontrol
188382441 segments sent using TSO
4595103991390 bytes sent using TSO
73883767 TSO segments truncated
1069 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
366670238 recv upcalls batched in HP
302647105 recv upcalls made in HP
296877004 recv upcalls made in HP because of PSH
2291336 recv upcalls made in HP because of sb_hiwat
3481239 recv upcalls made in HP because of both PSH and sb_hiwat
6733214 recv upcall batch timeouts
16594187 times recv upcall read partial sb_cc in HP
631681762 segments received using LRO
4816721023400 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6
---- ANYVSERVER IPSpace ----
tcp:
0 packets sent
0 data packets (0 bytes)
0 data packets (0 bytes) retransmitted
0 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
0 ack-only packets (0 delayed)
0 URG only packets
0 window probe packets
0 window update packets
0 control packets
0 packets received
0 acks (for 0 bytes)
0 duplicate acks
0 acks for unsent data
0 packets (0 bytes) received in-sequence
0 completely duplicate packets (0 bytes)
0 old duplicate packets
0 packets with some dup. data (0 bytes duped)
0 out-of-order packets (0 bytes)
0 packets (0 bytes) of data after window
0 window probes
0 window update packets
0 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
0 discarded due to memory problems
0 connection requests
0 connection accepts
0 bad connection attempts
0 listen queue overflows
0 ignored RSTs in the windows
0 connections established (including accepts)
7 connections closed (including 0 drops)
0 connections updated cached RTT on close
0 connections updated cached RTT variance on close
0 connections updated cached ssthresh on close
0 embryonic connections dropped
0 segments updated rtt (of 0 attempts)
0 retransmit timeouts
0 connections dropped by rexmit timeout
0 persist timeouts
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
0 keepalive timeouts
0 keepalive probes sent
0 connections dropped by keepalive
0 correct ACK header predictions
0 correct data packet header predictions
0 syncache entries added
0 retransmitted
0 dupsyn
0 dropped
0 completed
0 bucket overflow
0 cache overflow
0 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
0 cookies sent
0 cookies received
0 hostcache entries added
0 bucket overflow
0 SACK recovery episodes
0 segment rexmits in SACK recovery episodes
0 byte rexmits in SACK recovery episodes
0 SACK options (SACK blocks) received
0 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
0 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
0 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
0 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
0 times the receive window was closed
0 dropped due to flowcontrol
0 segments sent using TSO
0 bytes sent using TSO
0 TSO segments truncated
0 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
0 recv upcalls batched in HP
0 recv upcalls made in HP
0 recv upcalls made in HP because of PSH
0 recv upcalls made in HP because of sb_hiwat
0 recv upcalls made in HP because of both PSH and sb_hiwat
0 recv upcall batch timeouts
0 times recv upcall read partial sb_cc in HP
0 segments received using LRO
0 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6
---- Cluster IPSpace ----
tcp:
350960787 packets sent
253625385 data packets (2042642509989 bytes)
11525 data packets (120517203 bytes) retransmitted
63 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
38550609 ack-only packets (15348627 delayed)
0 URG only packets
1 window probe packet
56728197 window update packets
2035396 control packets
341097715 packets received
224460892 acks (for 2042726150883 bytes)
6840725 duplicate acks
0 acks for unsent data
271870811 packets (3031038679110 bytes) received in-sequence
195650 completely duplicate packets (4506 bytes)
49 old duplicate packets
0 packets with some dup. data (0 bytes duped)
205398 out-of-order packets (565766073 bytes)
0 packets (0 bytes) of data after window
0 window probes
2011210 window update packets
123 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
0 discarded due to memory problems
923539 connection requests
456892 connection accepts
0 bad connection attempts
0 listen queue overflows
529 ignored RSTs in the windows
1271558 connections established (including accepts)
1379180 connections closed (including 1101 drops)
369895 connections updated cached RTT on close
370750 connections updated cached RTT variance on close
12122 connections updated cached ssthresh on close
108207 embryonic connections dropped
224454663 segments updated rtt (of 207849890 attempts)
48471 retransmit timeouts
14 connections dropped by rexmit timeout
1 persist timeout
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
152128 keepalive timeouts
147328 keepalive probes sent
4800 connections dropped by keepalive
45057764 correct ACK header predictions
104981779 correct data packet header predictions
457057 syncache entries added
61 retransmitted
0 dupsyn
0 dropped
456892 completed
0 bucket overflow
0 cache overflow
165 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
457057 cookies sent
0 cookies received
61 hostcache entries added
0 bucket overflow
1684 SACK recovery episodes
2491 segment rexmits in SACK recovery episodes
5618157 byte rexmits in SACK recovery episodes
17518 SACK options (SACK blocks) received
86946 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
0 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
0 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
0 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
0 times the receive window was closed
0 dropped due to flowcontrol
56607835 segments sent using TSO
1679494142753 bytes sent using TSO
36473474 TSO segments truncated
394 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
4879278 recv upcalls batched in HP
90401291 recv upcalls made in HP
90401967 recv upcalls made in HP because of PSH
52 recv upcalls made in HP because of sb_hiwat
325 recv upcalls made in HP because of both PSH and sb_hiwat
32882 recv upcall batch timeouts
524 times recv upcall read partial sb_cc in HP
160827213 segments received using LRO
2789346524807 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6
---- ips_4294967289 IPSpace ----
tcp:
0 packets sent
0 data packets (0 bytes)
0 data packets (0 bytes) retransmitted
0 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
0 ack-only packets (0 delayed)
0 URG only packets
0 window probe packets
0 window update packets
0 control packets
0 packets received
0 acks (for 0 bytes)
0 duplicate acks
0 acks for unsent data
0 packets (0 bytes) received in-sequence
0 completely duplicate packets (0 bytes)
0 old duplicate packets
0 packets with some dup. data (0 bytes duped)
0 out-of-order packets (0 bytes)
0 packets (0 bytes) of data after window
0 window probes
0 window update packets
0 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
0 discarded due to memory problems
0 connection requests
0 connection accepts
0 bad connection attempts
0 listen queue overflows
0 ignored RSTs in the windows
0 connections established (including accepts)
0 connections closed (including 0 drops)
0 connections updated cached RTT on close
0 connections updated cached RTT variance on close
0 connections updated cached ssthresh on close
0 embryonic connections dropped
0 segments updated rtt (of 0 attempts)
0 retransmit timeouts
0 connections dropped by rexmit timeout
0 persist timeouts
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
0 keepalive timeouts
0 keepalive probes sent
0 connections dropped by keepalive
0 correct ACK header predictions
0 correct data packet header predictions
0 syncache entries added
0 retransmitted
0 dupsyn
0 dropped
0 completed
0 bucket overflow
0 cache overflow
0 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
0 cookies sent
0 cookies received
0 hostcache entries added
0 bucket overflow
0 SACK recovery episodes
0 segment rexmits in SACK recovery episodes
0 byte rexmits in SACK recovery episodes
0 SACK options (SACK blocks) received
0 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
0 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
0 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
0 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
0 times the receive window was closed
0 dropped due to flowcontrol
0 segments sent using TSO
0 bytes sent using TSO
0 TSO segments truncated
0 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
0 recv upcalls batched in HP
0 recv upcalls made in HP
0 recv upcalls made in HP because of PSH
0 recv upcalls made in HP because of sb_hiwat
0 recv upcalls made in HP because of both PSH and sb_hiwat
0 recv upcall batch timeouts
0 times recv upcall read partial sb_cc in HP
0 segments received using LRO
0 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6
---- ACP IPSpace ----
tcp:
86643 packets sent
17496 data packets (419904 bytes)
0 data packets (0 bytes) retransmitted
0 data packets unnecessarily retransmitted
0 resends initiated by MTU discovery
33848 ack-only packets (0 delayed)
0 URG only packets
0 window probe packets
23 window update packets
35276 control packets
74406 packets received
51152 acks (for 436064 bytes)
4798 duplicate acks
0 acks for unsent data
20938 packets (1251746 bytes) received in-sequence
0 completely duplicate packets (0 bytes)
0 old duplicate packets
0 packets with some dup. data (0 bytes duped)
0 out-of-order packets (0 bytes)
0 packets (0 bytes) of data after window
0 window probes
0 window update packets
1686 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
0 discarded due to memory problems
17605 connection requests
176 connection accepts
0 bad connection attempts
0 listen queue overflows
0 ignored RSTs in the windows
17672 connections established (including accepts)
17781 connections closed (including 2 drops)
0 connections updated cached RTT on close
0 connections updated cached RTT variance on close
0 connections updated cached ssthresh on close
0 embryonic connections dropped
51152 segments updated rtt (of 52750 attempts)
109 retransmit timeouts
0 connections dropped by rexmit timeout
0 persist timeouts
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
0 keepalive timeouts
0 keepalive probes sent
0 connections dropped by keepalive
17474 correct ACK header predictions
4954 correct data packet header predictions
176 syncache entries added
0 retransmitted
0 dupsyn
0 dropped
176 completed
0 bucket overflow
0 cache overflow
0 reset
0 stale
0 aborted
0 badack
0 unreach
0 zone failures
176 cookies sent
0 cookies received
0 hostcache entries added
0 bucket overflow
0 SACK recovery episodes
0 segment rexmits in SACK recovery episodes
0 byte rexmits in SACK recovery episodes
0 SACK options (SACK blocks) received
0 SACK options (SACK blocks) sent
0 SACK scoreboard overflow
0 packets with ECN CE bit set
0 packets with ECN ECT(0) bit set
0 packets with ECN ECT(1) bit set
0 successful ECN handshakes
0 times ECN reduced the congestion window
0 times in ONTAP flow control
0 times exited ONTAP flow control
0 times in ONTAP flow control for zero send window
0 times in ONTAP flow control for non-zero send window
0 connection resets due to ONTAP extreme flow control
0 times in ONTAP extreme flow control
0 is the maximum flow control reset threshold reached during receive
0 is the maximum flow control reset threshold reached during send
0 bytes is send buffer value during last reset
0 bytes is send buffer hiwat mark during last reset
0 times the receive window was closed
0 dropped due to flowcontrol
0 segments sent using TSO
0 bytes sent using TSO
0 TSO segments truncated
0 TSO wrapped sequence space segments
0 segments sent using TSO6
0 bytes sent using TSO6
0 TSO6 segments truncated
0 TSO6 wrapped sequence space segments
0 recv upcalls batched in HP
0 recv upcalls made in HP
0 recv upcalls made in HP because of PSH
0 recv upcalls made in HP because of sb_hiwat
0 recv upcalls made in HP because of both PSH and sb_hiwat
0 recv upcall batch timeouts
0 times recv upcall read partial sb_cc in HP
0 segments received using LRO
0 bytes received using LRO
0 segments received using LRO6
0 bytes received using LRO6
Server tcp entries
[root@jwukccsbci ~]# sysctl -a | grep slot
sunrpc.tcp_slot_table_entries = 128
sunrpc.udp_slot_table_entries = 128
dev.cdrom.info = drive # of slots: 1
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 24 January 2018 11:53
To: Mark Saunders; Parisi, Justin; Fenn, Michael;
toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2
Could this be TCP slot tables? Flow control capabilities on ONTAP continue to improve. If you don't have TCP slot tables capped at 128 you could see quasi-hangs like this.
Complete details are in TR-3633, but these are the two that you want to watch:
[root@stlrx300s7-145 mkdb]# sysctl -a | grep slot
sunrpc.tcp_max_slot_table_entries = 128
sunrpc.tcp_slot_table_entries = 128
Newer versions of linux will allow a ridiculous number of unacknowledged RPC operations to build up. The result can be sending ONTAP into a flow control mode until the OS catches up. We see problems mostly in slow clients.
For example, if you're trying to read a lot of data from a host with 1Gb connectivity on a high-end ONTAP system the OS can ask for data quicker than it can process the responses.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Wednesday, January 24, 2018 12:26 PM
To: Parisi, Justin <Justin.Parisi@netapp.com>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.com>;
toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2
Justin
I have just checked the SVM and there are no admin/management interfaces configured for it there are three data lifs for different vlans. I have checked through our other systems this morning and there are no
issues in vmware (5.5) or SLES 11/12 so this is just with the redhat servers.
I have checked the interfaces at the server end and it is not showing errors or dropped packets. On the filer end we have 4 physical ports in an interface group with vlans on top. I have run “statistics start
–obeject nfs_exports_access_cache” which when checked doesn’t report any errors.
On the server interface
eth1 Link encap:Ethernet HWaddr 00:50:56:A5:0D:6A
inet addr:10.240.1.30 Bcast:10.240.1.31 Mask:255.255.255.224
inet6 addr: fe80::250:56ff:fea5:d6a/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:127209 errors:0 dropped:0 overruns:0 frame:0
TX packets:26100 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:104158360 (99.3 MiB) TX bytes:14489402 (13.8 MiB)
While investigating we have found that the file system is fine just after a reboot and you can ls each mount so they are initially all OK. It is when starting the application so putting a bigger load over the
network that the file systems stop responding.
Regards
Mark
From: Parisi, Justin [mailto:Justin.Parisi@netapp.com]
Sent: 23 January 2018 22:33
To: Steiner, Jeffrey; Mark Saunders; Fenn, Michael;
toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2
In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?
“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”
From: Parisi, Justin
Sent: Tuesday, January 23, 2018 5:30 PM
To: Parisi, Justin <Justin.Parisi@netapp.com>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.com>;
Fenn, Michael <fennm@DEShawResearch.com>;
toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2
This community post also does a good job explaining it:
From:
toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net]
On Behalf Of Parisi, Justin
Sent: Tuesday, January 23, 2018 5:28 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.com>;
toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2
The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.
https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-9EDC-868C5292381B.html
The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet
is being dropped.
From:
toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net]
On Behalf Of Steiner, Jeffrey
Sent: Tuesday, January 23, 2018 5:24 PM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.com>;
toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2
I should have asked - is this SAP HANA or something like SAP on an Oracle database?
Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll
support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.
The thing about fastpath does ring a few bells.
From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Tuesday, January 23, 2018 11:18 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.com>;
toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2
Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.
I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)
A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.
The mount options were kept fairly straight forward as
nfs nolock,_netdev,udp 0 0
and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.
nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0
How would I be able to tell if we are using DNFS ?
I will send you the support details tomorrow when I am back in the office.
Regards
Mark
From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 23 January 2018 17:29
To: Fenn, Michael; Mark Saunders; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2
It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.
I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.
From: Fenn, Michael [mailto:fennm@DEShawResearch.com]
Sent: Tuesday, January 23, 2018 6:23 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.com>;
toasters@teaparty.net
Subject: Re: NFS issue after upgrading filers to 9.2P2
The messages are not necessarily indicative of a network problem.
The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries. Once the server responds, then it prints "nfs:
server … OK".
Networking problems are certainly one reason that an operation would time out, but not the only reason. An overloaded or down file server will cause the same effect.
Thanks,
Michael
From:
<toasters-bounces@teaparty.net> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.com>
Date: Tuesday, January 23, 2018 at 10:38 AM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com>, "toasters@teaparty.net" <toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2
I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have
no access whatsoever.
I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.
What are the mount options used, and are you using DNFS?
From:
toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net]
On Behalf Of Mark Saunders
Sent: Tuesday, January 23, 2018 4:29 PM
To: toasters@teaparty.net
Subject: NFS issue after upgrading filers to 9.2P2
Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems
but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are
a lost cause right now.
In the server messages we get lots of entries for the SVM
Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:01:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Jan 23 07:02:07 jwukccsbci last message repeated 5 times
Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying
Jan 23 07:02:47 jwukccsbci last message repeated 5 times
Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK
Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.
The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.
Regards
Mark
Data Centre Sysadmin Team
Managed Services
Phone:- 02476 694455 Ext 2567
The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~
The information contained in this e-mail is intended only for the person or entity to which it is addressed
and may contain confidential and/or privileged material. If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful. If you received this in error, please
contact the sender and delete the material from any computer. The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained. The PCMS Group reserves
the right to monitor and examine the content of all e-mails.
The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business
Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743