After a bit of an email search it was this bug

 

https://bugzilla.redhat.com/show_bug.cgi?id=321111

 

 

From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 24 January 2018 12:07
To: Mark Saunders; Parisi, Justin; Fenn, Michael; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

 

If that's 441463, I'm skeptical that's the problem. That might cause problems during boot, but I wouldn’t expect it to cause problems later. Also, an ONTAP upgrade shouldn't affect this.

 

I'll subscribe to the case and follow along. The stats below do show some possible problems. There was some flow control activity, and the SACK numbers look high to me.

 

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Wednesday, January 24, 2018 1:02 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Parisi, Justin <Justin.Parisi@netapp.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

 

I will try to find the kernel bug number as I cant see it in the documentation for the server there is just the following note.

 

RHEL 5.11 has a bug where NFS mounts mounted after network initialization at boot run with an increased number of TCP requests (approx 10x more) which causes rpc backlog and restricts network throughput on the NFS mounts.

To resolve this a script has been created to restart the networking before the NFS mounts are mounted by netfs at boot. By default netfs runs at boot s25 on runlevel 3, 4 and 5 so we will set the NFS fix to run at s24 on the same run levels.

 

PGUKCSTGCL01::*> node run -node PGUKCSTGCL01-01 -command netstat -sp tcp

---- Default IPSpace ----

tcp:

        900103907 packets sent

                476280230 data packets (4676048494764 bytes)

                61984 data packets (82328048 bytes) retransmitted

                2065 data packets unnecessarily retransmitted

                0 resends initiated by MTU discovery

                235945463 ack-only packets (517654 delayed)

                0 URG only packets

                0 window probe packets

               187429557 window update packets

                333130 control packets

        1097649475 packets received

                399065895 acks (for 4676054895668 bytes)

                2174268 duplicate acks

                0 acks for unsent data

                723809875 packets (4886339861169 bytes) received in-sequence

                1649638 completely duplicate packets (98637034 bytes)

                2 old duplicate packets

                990 packets with some dup. data (214519 bytes duped)

                10872239 out-of-order packets (15192422547 bytes)

                0 packets (0 bytes) of data after window

                0 window probes

                26845 window update packets

                2 packets received after close

                0 discarded for bad checksums

                0 discarded for bad header offset fields

                0 discarded because packet too short

                37581 discarded due to memory problems

        1441 connection requests

        412966 connection accepts

        0 bad connection attempts

        0 listen queue overflows

        305109 ignored RSTs in the windows

        414403 connections established (including accepts)

        443890 connections closed (including 139609 drops)

                151376 connections updated cached RTT on close

                151388 connections updated cached RTT variance on close

                140203 connections updated cached ssthresh on close

        0 embryonic connections dropped

        388403781 segments updated rtt (of 258539924 attempts)

        6843 retransmit timeouts

                11 connections dropped by rexmit timeout

        3 persist timeouts

                0 connections dropped by persist timeout

        0 Connections (fin_wait_2) dropped because of timeout

        92323 keepalive timeouts

                92323 keepalive probes sent

                0 connections dropped by keepalive

        351415606 correct ACK header predictions

        684179955 correct data packet header predictions

        412966 syncache entries added

                155 retransmitted

                302 dupsyn

                0 dropped

                412966 completed

                0 bucket overflow

                0 cache overflow

                0 reset

                0 stale

                0 aborted

                0 badack

                0 unreach

                0 zone failures

        412966 cookies sent

        0 cookies received

        112 hostcache entries added

                0 bucket overflow

        16181 SACK recovery episodes

        51541 segment rexmits in SACK recovery episodes

        70735551 byte rexmits in SACK recovery episodes

        277116 SACK options (SACK blocks) received

        11457931 SACK options (SACK blocks) sent

        0 SACK scoreboard overflow

        0 packets with ECN CE bit set

        0 packets with ECN ECT(0) bit set

        0 packets with ECN ECT(1) bit set

        0 successful ECN handshakes

        0 times ECN reduced the congestion window

        251543 times in ONTAP flow control

        0 times exited ONTAP flow control

        0 times in ONTAP flow control for zero send window

        251543 times in ONTAP flow control for non-zero send window

        0 connection resets due to ONTAP extreme flow control

        0 times in ONTAP extreme flow control

        0 is the maximum flow control reset threshold reached during receive

        4 is the maximum flow control reset threshold reached during send

        0 bytes is send buffer value during last reset

        0 bytes is send buffer hiwat mark during last reset

        79 times the receive window was closed

        44 dropped due to flowcontrol

        188382441 segments sent using TSO

        4595103991390 bytes sent using TSO

        73883767 TSO segments truncated

        1069 TSO wrapped sequence space segments

        0 segments sent using TSO6

        0 bytes sent using TSO6

        0 TSO6 segments truncated

        0 TSO6 wrapped sequence space segments

        366670238 recv upcalls batched in HP

        302647105 recv upcalls made in HP

        296877004 recv upcalls made in HP because of PSH

        2291336 recv upcalls made in HP because of sb_hiwat

        3481239 recv upcalls made in HP because of both PSH and sb_hiwat

        6733214 recv upcall batch timeouts

        16594187 times recv upcall read partial sb_cc in HP

        631681762 segments received using LRO

        4816721023400 bytes received using LRO

        0 segments received using LRO6

        0 bytes received using LRO6

---- ANYVSERVER IPSpace ----

tcp:

        0 packets sent

                0 data packets (0 bytes)

                0 data packets (0 bytes) retransmitted

                0 data packets unnecessarily retransmitted

                0 resends initiated by MTU discovery

                0 ack-only packets (0 delayed)

                0 URG only packets

                0 window probe packets

                0 window update packets

                0 control packets

        0 packets received

                0 acks (for 0 bytes)

                0 duplicate acks

                0 acks for unsent data

                0 packets (0 bytes) received in-sequence

                0 completely duplicate packets (0 bytes)

                0 old duplicate packets

                0 packets with some dup. data (0 bytes duped)

                0 out-of-order packets (0 bytes)

                0 packets (0 bytes) of data after window

                0 window probes

                0 window update packets

                0 packets received after close

                0 discarded for bad checksums

                0 discarded for bad header offset fields

                0 discarded because packet too short

                0 discarded due to memory problems

        0 connection requests

        0 connection accepts

        0 bad connection attempts

        0 listen queue overflows

        0 ignored RSTs in the windows

        0 connections established (including accepts)

        7 connections closed (including 0 drops)

                0 connections updated cached RTT on close

                0 connections updated cached RTT variance on close

                0 connections updated cached ssthresh on close

        0 embryonic connections dropped

        0 segments updated rtt (of 0 attempts)

        0 retransmit timeouts

                0 connections dropped by rexmit timeout

        0 persist timeouts

                0 connections dropped by persist timeout

        0 Connections (fin_wait_2) dropped because of timeout

        0 keepalive timeouts

                0 keepalive probes sent

                0 connections dropped by keepalive

        0 correct ACK header predictions

        0 correct data packet header predictions

        0 syncache entries added

                0 retransmitted

                0 dupsyn

                0 dropped

                0 completed

                0 bucket overflow

                0 cache overflow

                0 reset

                0 stale

                0 aborted

                0 badack

                0 unreach

                0 zone failures

        0 cookies sent

        0 cookies received

        0 hostcache entries added

                0 bucket overflow

        0 SACK recovery episodes

        0 segment rexmits in SACK recovery episodes

        0 byte rexmits in SACK recovery episodes

        0 SACK options (SACK blocks) received

        0 SACK options (SACK blocks) sent

        0 SACK scoreboard overflow

        0 packets with ECN CE bit set

        0 packets with ECN ECT(0) bit set

        0 packets with ECN ECT(1) bit set

        0 successful ECN handshakes

        0 times ECN reduced the congestion window

        0 times in ONTAP flow control

        0 times exited ONTAP flow control

        0 times in ONTAP flow control for zero send window

        0 times in ONTAP flow control for non-zero send window

        0 connection resets due to ONTAP extreme flow control

        0 times in ONTAP extreme flow control

        0 is the maximum flow control reset threshold reached during receive

        0 is the maximum flow control reset threshold reached during send

        0 bytes is send buffer value during last reset

        0 bytes is send buffer hiwat mark during last reset

        0 times the receive window was closed

        0 dropped due to flowcontrol

        0 segments sent using TSO

        0 bytes sent using TSO

        0 TSO segments truncated

        0 TSO wrapped sequence space segments

        0 segments sent using TSO6

        0 bytes sent using TSO6

        0 TSO6 segments truncated

        0 TSO6 wrapped sequence space segments

        0 recv upcalls batched in HP

        0 recv upcalls made in HP

        0 recv upcalls made in HP because of PSH

        0 recv upcalls made in HP because of sb_hiwat

        0 recv upcalls made in HP because of both PSH and sb_hiwat

        0 recv upcall batch timeouts

        0 times recv upcall read partial sb_cc in HP

        0 segments received using LRO

        0 bytes received using LRO

        0 segments received using LRO6

        0 bytes received using LRO6

---- Cluster IPSpace ----

tcp:

        350960787 packets sent

                253625385 data packets (2042642509989 bytes)

                11525 data packets (120517203 bytes) retransmitted

                63 data packets unnecessarily retransmitted

                0 resends initiated by MTU discovery

                38550609 ack-only packets (15348627 delayed)

                0 URG only packets

                1 window probe packet

                56728197 window update packets

                2035396 control packets

        341097715 packets received

                224460892 acks (for 2042726150883 bytes)

                6840725 duplicate acks

                0 acks for unsent data

                271870811 packets (3031038679110 bytes) received in-sequence

                195650 completely duplicate packets (4506 bytes)

                49 old duplicate packets

                0 packets with some dup. data (0 bytes duped)

                205398 out-of-order packets (565766073 bytes)

                0 packets (0 bytes) of data after window

                0 window probes

                2011210 window update packets

                123 packets received after close

                0 discarded for bad checksums

                0 discarded for bad header offset fields

                0 discarded because packet too short

                0 discarded due to memory problems

        923539 connection requests

        456892 connection accepts

        0 bad connection attempts

        0 listen queue overflows

        529 ignored RSTs in the windows

        1271558 connections established (including accepts)

        1379180 connections closed (including 1101 drops)

                369895 connections updated cached RTT on close

                370750 connections updated cached RTT variance on close

                12122 connections updated cached ssthresh on close

        108207 embryonic connections dropped

        224454663 segments updated rtt (of 207849890 attempts)

        48471 retransmit timeouts

                14 connections dropped by rexmit timeout

        1 persist timeout

                0 connections dropped by persist timeout

        0 Connections (fin_wait_2) dropped because of timeout

        152128 keepalive timeouts

                147328 keepalive probes sent

                4800 connections dropped by keepalive

        45057764 correct ACK header predictions

        104981779 correct data packet header predictions

        457057 syncache entries added

                61 retransmitted

                0 dupsyn

                0 dropped

                456892 completed

                0 bucket overflow

                0 cache overflow

                165 reset

                0 stale

                0 aborted

                0 badack

                0 unreach

                0 zone failures

        457057 cookies sent

        0 cookies received

        61 hostcache entries added

                0 bucket overflow

        1684 SACK recovery episodes

        2491 segment rexmits in SACK recovery episodes

        5618157 byte rexmits in SACK recovery episodes

        17518 SACK options (SACK blocks) received

        86946 SACK options (SACK blocks) sent

        0 SACK scoreboard overflow

        0 packets with ECN CE bit set

        0 packets with ECN ECT(0) bit set

        0 packets with ECN ECT(1) bit set

        0 successful ECN handshakes

        0 times ECN reduced the congestion window

        0 times in ONTAP flow control

        0 times exited ONTAP flow control

        0 times in ONTAP flow control for zero send window

        0 times in ONTAP flow control for non-zero send window

        0 connection resets due to ONTAP extreme flow control

        0 times in ONTAP extreme flow control

        0 is the maximum flow control reset threshold reached during receive

        0 is the maximum flow control reset threshold reached during send

        0 bytes is send buffer value during last reset

        0 bytes is send buffer hiwat mark during last reset

        0 times the receive window was closed

        0 dropped due to flowcontrol

        56607835 segments sent using TSO

        1679494142753 bytes sent using TSO

        36473474 TSO segments truncated

        394 TSO wrapped sequence space segments

        0 segments sent using TSO6

        0 bytes sent using TSO6

        0 TSO6 segments truncated

        0 TSO6 wrapped sequence space segments

        4879278 recv upcalls batched in HP

        90401291 recv upcalls made in HP

        90401967 recv upcalls made in HP because of PSH

        52 recv upcalls made in HP because of sb_hiwat

        325 recv upcalls made in HP because of both PSH and sb_hiwat

        32882 recv upcall batch timeouts

        524 times recv upcall read partial sb_cc in HP

        160827213 segments received using LRO

        2789346524807 bytes received using LRO

        0 segments received using LRO6

        0 bytes received using LRO6

---- ips_4294967289 IPSpace ----

tcp:

        0 packets sent

                0 data packets (0 bytes)

                0 data packets (0 bytes) retransmitted

                0 data packets unnecessarily retransmitted

                0 resends initiated by MTU discovery

                0 ack-only packets (0 delayed)

                0 URG only packets

                0 window probe packets

                0 window update packets

                0 control packets

        0 packets received

                0 acks (for 0 bytes)

                0 duplicate acks

                0 acks for unsent data

                0 packets (0 bytes) received in-sequence

                0 completely duplicate packets (0 bytes)

                0 old duplicate packets

                0 packets with some dup. data (0 bytes duped)

                0 out-of-order packets (0 bytes)

                0 packets (0 bytes) of data after window

                0 window probes

                0 window update packets

                0 packets received after close

                0 discarded for bad checksums

                0 discarded for bad header offset fields

                0 discarded because packet too short

                0 discarded due to memory problems

        0 connection requests

        0 connection accepts

        0 bad connection attempts

        0 listen queue overflows

        0 ignored RSTs in the windows

        0 connections established (including accepts)

        0 connections closed (including 0 drops)

                0 connections updated cached RTT on close

                0 connections updated cached RTT variance on close

                0 connections updated cached ssthresh on close

        0 embryonic connections dropped

        0 segments updated rtt (of 0 attempts)

        0 retransmit timeouts

                0 connections dropped by rexmit timeout

        0 persist timeouts

                0 connections dropped by persist timeout

        0 Connections (fin_wait_2) dropped because of timeout

        0 keepalive timeouts

                0 keepalive probes sent

                0 connections dropped by keepalive

        0 correct ACK header predictions

        0 correct data packet header predictions

        0 syncache entries added

                0 retransmitted

                0 dupsyn

                0 dropped

                0 completed

                0 bucket overflow

                0 cache overflow

                0 reset

                0 stale

                0 aborted

                0 badack

                0 unreach

                0 zone failures

        0 cookies sent

        0 cookies received

        0 hostcache entries added

                0 bucket overflow

        0 SACK recovery episodes

        0 segment rexmits in SACK recovery episodes

        0 byte rexmits in SACK recovery episodes

        0 SACK options (SACK blocks) received

        0 SACK options (SACK blocks) sent

        0 SACK scoreboard overflow

        0 packets with ECN CE bit set

        0 packets with ECN ECT(0) bit set

        0 packets with ECN ECT(1) bit set

        0 successful ECN handshakes

        0 times ECN reduced the congestion window

        0 times in ONTAP flow control

        0 times exited ONTAP flow control

        0 times in ONTAP flow control for zero send window

        0 times in ONTAP flow control for non-zero send window

        0 connection resets due to ONTAP extreme flow control

        0 times in ONTAP extreme flow control

        0 is the maximum flow control reset threshold reached during receive

        0 is the maximum flow control reset threshold reached during send

        0 bytes is send buffer value during last reset

        0 bytes is send buffer hiwat mark during last reset

        0 times the receive window was closed

        0 dropped due to flowcontrol

        0 segments sent using TSO

        0 bytes sent using TSO

        0 TSO segments truncated

        0 TSO wrapped sequence space segments

        0 segments sent using TSO6

        0 bytes sent using TSO6

        0 TSO6 segments truncated

        0 TSO6 wrapped sequence space segments

        0 recv upcalls batched in HP

        0 recv upcalls made in HP

        0 recv upcalls made in HP because of PSH

        0 recv upcalls made in HP because of sb_hiwat

        0 recv upcalls made in HP because of both PSH and sb_hiwat

        0 recv upcall batch timeouts

        0 times recv upcall read partial sb_cc in HP

        0 segments received using LRO

        0 bytes received using LRO

        0 segments received using LRO6

        0 bytes received using LRO6

---- ACP IPSpace ----

tcp:

        86643 packets sent

                17496 data packets (419904 bytes)

                0 data packets (0 bytes) retransmitted

                0 data packets unnecessarily retransmitted

                0 resends initiated by MTU discovery

                33848 ack-only packets (0 delayed)

                0 URG only packets

                0 window probe packets

                23 window update packets

                35276 control packets

        74406 packets received

                51152 acks (for 436064 bytes)

                4798 duplicate acks

                0 acks for unsent data

                20938 packets (1251746 bytes) received in-sequence

                0 completely duplicate packets (0 bytes)

                0 old duplicate packets

                0 packets with some dup. data (0 bytes duped)

                0 out-of-order packets (0 bytes)

                0 packets (0 bytes) of data after window

                0 window probes

                0 window update packets

                1686 packets received after close

                0 discarded for bad checksums

                0 discarded for bad header offset fields

                0 discarded because packet too short

                0 discarded due to memory problems

        17605 connection requests

        176 connection accepts

        0 bad connection attempts

        0 listen queue overflows

        0 ignored RSTs in the windows

        17672 connections established (including accepts)

        17781 connections closed (including 2 drops)

                0 connections updated cached RTT on close

                0 connections updated cached RTT variance on close

                0 connections updated cached ssthresh on close

        0 embryonic connections dropped

        51152 segments updated rtt (of 52750 attempts)

        109 retransmit timeouts

                0 connections dropped by rexmit timeout

        0 persist timeouts

                0 connections dropped by persist timeout

        0 Connections (fin_wait_2) dropped because of timeout

        0 keepalive timeouts

                0 keepalive probes sent

                0 connections dropped by keepalive

        17474 correct ACK header predictions

        4954 correct data packet header predictions

        176 syncache entries added

                0 retransmitted

                0 dupsyn

                0 dropped

                176 completed

                0 bucket overflow

                0 cache overflow

                0 reset

                0 stale

                0 aborted

                0 badack

                0 unreach

                0 zone failures

        176 cookies sent

        0 cookies received

        0 hostcache entries added

                0 bucket overflow

        0 SACK recovery episodes

        0 segment rexmits in SACK recovery episodes

        0 byte rexmits in SACK recovery episodes

        0 SACK options (SACK blocks) received

        0 SACK options (SACK blocks) sent

        0 SACK scoreboard overflow

        0 packets with ECN CE bit set

        0 packets with ECN ECT(0) bit set

        0 packets with ECN ECT(1) bit set

        0 successful ECN handshakes

        0 times ECN reduced the congestion window

        0 times in ONTAP flow control

        0 times exited ONTAP flow control

        0 times in ONTAP flow control for zero send window

        0 times in ONTAP flow control for non-zero send window

        0 connection resets due to ONTAP extreme flow control

        0 times in ONTAP extreme flow control

        0 is the maximum flow control reset threshold reached during receive

        0 is the maximum flow control reset threshold reached during send

        0 bytes is send buffer value during last reset

        0 bytes is send buffer hiwat mark during last reset

        0 times the receive window was closed

        0 dropped due to flowcontrol

        0 segments sent using TSO

        0 bytes sent using TSO

        0 TSO segments truncated

        0 TSO wrapped sequence space segments

        0 segments sent using TSO6

        0 bytes sent using TSO6

        0 TSO6 segments truncated

        0 TSO6 wrapped sequence space segments

        0 recv upcalls batched in HP

        0 recv upcalls made in HP

        0 recv upcalls made in HP because of PSH

        0 recv upcalls made in HP because of sb_hiwat

        0 recv upcalls made in HP because of both PSH and sb_hiwat

        0 recv upcall batch timeouts

        0 times recv upcall read partial sb_cc in HP

        0 segments received using LRO

        0 bytes received using LRO

        0 segments received using LRO6

        0 bytes received using LRO6

 

 

Server tcp entries

 

[root@jwukccsbci ~]# sysctl -a | grep slot

sunrpc.tcp_slot_table_entries = 128

sunrpc.udp_slot_table_entries = 128

dev.cdrom.info = drive # of slots:      1

 

From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 24 January 2018 11:53
To: Mark Saunders; Parisi, Justin; Fenn, Michael; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

 

Could this be TCP slot tables? Flow control capabilities on ONTAP continue to improve. If you don't have TCP slot tables capped at 128 you could see quasi-hangs like this.

 

Complete details are in TR-3633, but these are the two that you want to watch:

 

[root@stlrx300s7-145 mkdb]# sysctl -a | grep slot

sunrpc.tcp_max_slot_table_entries = 128

sunrpc.tcp_slot_table_entries = 128

 

Newer versions of linux will allow a ridiculous number of unacknowledged RPC operations to build up. The result can be sending ONTAP into a flow control mode until the OS catches up. We see problems mostly in slow clients. For example, if you're trying to read a lot of data from a host with 1Gb connectivity on a high-end ONTAP system the OS can ask for data quicker than it can process the responses.

 

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Wednesday, January 24, 2018 12:26 PM
To: Parisi, Justin <Justin.Parisi@netapp.com>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

 

Justin

 

I have just checked the SVM and there are no admin/management interfaces configured for it there are three data lifs for different vlans. I have checked through our other systems this morning and there are no issues in vmware (5.5) or SLES 11/12 so this is just with the redhat servers.

 

I have checked the interfaces at the server end and it is not showing errors or dropped packets. On the filer end we have 4 physical ports in an interface group with vlans on top. I have run “statistics start –obeject nfs_exports_access_cache” which when checked doesn’t report any errors.

 

On the server interface

 

eth1      Link encap:Ethernet  HWaddr 00:50:56:A5:0D:6A

          inet addr:10.240.1.30  Bcast:10.240.1.31  Mask:255.255.255.224

          inet6 addr: fe80::250:56ff:fea5:d6a/64 Scope:Link

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          RX packets:127209 errors:0 dropped:0 overruns:0 frame:0

          TX packets:26100 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000

          RX bytes:104158360 (99.3 MiB)  TX bytes:14489402 (13.8 MiB)

 

 

 

While investigating we have found that the file system is fine just after a reboot and you can ls each mount so they are initially all OK. It is when starting the application so putting a bigger load over the network that the file systems stop responding.

 

 

Regards

 

Mark

 

From: Parisi, Justin [mailto:Justin.Parisi@netapp.com]
Sent: 23 January 2018 22:33
To: Steiner, Jeffrey; Mark Saunders; Fenn, Michael; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

 

In fact, maybe look at this as a root cause… do your NFS interfaces share nodes with admin interfaces?

 

“NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.”

 

From: Parisi, Justin
Sent: Tuesday, January 23, 2018 5:30 PM
To: Parisi, Justin <Justin.Parisi@netapp.com>; Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

 

This community post also does a good job explaining it:

 

https://community.netapp.com/t5/Data-ONTAP-Discussions/NetApp-Ontap-9-2-Upgrade-review-your-network-first/td-p/136657

 

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Parisi, Justin
Sent: Tuesday, January 23, 2018 5:28 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

 

The network stack changed in 9.2 and IP fastpath was removed. But fastpath was mainly for more efficient routing.

 

https://library.netapp.com/ecmdocs/ECMP1114171/html/GUID-8276014A-16EB-4902-9EDC-868C5292381B.html

 

The stack was changed to a more standard BSD stack, so fastpath was no longer needed. It’s possible that’s an issue here, but I’d suggest getting network sniffs on each endpoint of the network to see where the packet is being dropped.

 

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey
Sent: Tuesday, January 23, 2018 5:24 PM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

 

I should have asked - is this SAP HANA or something like SAP on an Oracle database?

 

Also, what do they mean "it's not on the IMT?" Virtually everything NFS is on the IMT. We support any NFSv3 and NFSv4 client that obeys the specification. There's a tiny number of exceptions, but generally speaking we'll support linux, Solaris, AIX, mainframe, OpenVMS, HP-UX, Oracle DNFS, AS/400, etc. There really should be no issue there.

 

The thing about fastpath does ring a few bells.

 

From: Mark Saunders [mailto:Mark.Saunders@pcmsgroup.com]
Sent: Tuesday, January 23, 2018 11:18 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Fenn, Michael <fennm@DEShawResearch.com>; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

 

Thanks for the quick replies sorry for the delay in e responding but I was working on this since 5am so had to go sleep.

 

I have a call open with netapp but have had the coockie cutter response of it isn’t on the Interoperability Matrix Tool as a supported version (It wasn’t when on 9.1 anyway)

 

A third party we have contact with have sent me a link to details about fastpathing being removed but I don’t think we were using it so maybe another false line to look down.

 

The mount options were kept fairly straight forward as

 

nfs nolock,_netdev,udp 0 0

 

and we have also tried the same as the one of the production servers which had tuned options, this is on another cluster so isn’t affected by this yet.

 

nfsvers=3,nolock,_netdev,rw,udp,rsize=32768,wsize=32768,timeo=600 0 0

 

How would I be able to tell if we are using DNFS ?

 

I will send you the support details tomorrow when I am back in the office.

 

Regards

 

Mark

 

 

From: Steiner, Jeffrey [mailto:Jeffrey.Steiner@netapp.com]
Sent: 23 January 2018 17:29
To: Fenn, Michael; Mark Saunders; toasters@teaparty.net
Subject: RE: NFS issue after upgrading filers to 9.2P2

 

It takes a lot for an ONTAP system to flat-out be unable to respond. Unless the timeout parameters are exceedingly short, you shouldn't reach that point, especially with anything capable of running ONTAP 9.2.

 

I'd open a support case on this one. In addition, if you want to trigger an autosupport and send me the serial numbers directly I can take a glance at a few stats to see if anything looks odd.

 

From: Fenn, Michael [mailto:fennm@DEShawResearch.com]
Sent: Tuesday, January 23, 2018 6:23 PM
To: Steiner, Jeffrey <Jeffrey.Steiner@netapp.com>; Mark Saunders <Mark.Saunders@pcmsgroup.com>; toasters@teaparty.net
Subject: Re: NFS issue after upgrading filers to 9.2P2

 

The messages are not necessarily indicative of a network problem.

 

The kernel prints "nfs: server … not responding, still trying" when an operation times out (timeo deciseconds) for the configured (retrans) number of tries.  Once the server responds, then it prints "nfs: server … OK". 

 

Networking problems are certainly one reason that an operation would time out, but not the only reason.  An overloaded or down file server will cause the same effect.

 

Thanks,

Michael

 

From: <toasters-bounces@teaparty.net> on behalf of "Steiner, Jeffrey" <Jeffrey.Steiner@netapp.com>
Date: Tuesday, January 23, 2018 at 10:38 AM
To: Mark Saunders <Mark.Saunders@pcmsgroup.com>, "toasters@teaparty.net" <toasters@teaparty.net>
Subject: RE: NFS issue after upgrading filers to 9.2P2

 

Those messages are indicative of a network problem. The packets are going through, then they succeed when the NFS client retries, then they pause again.

 

I can't think why an ONTAP upgrade of this type would cause such a problem. If it was working before, it should be working now. If you had any kind of a locking, firewall, or general configuration problem you should have no access whatsoever.

 

I've seen some weird NFS bug sin SUSE, but that RHEL version should be fine.

 

What are the mount options used, and are you using DNFS?

 

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Saunders
Sent: Tuesday, January 23, 2018 4:29 PM
To:
toasters@teaparty.net
Subject: NFS issue after upgrading filers to 9.2P2

 

Hi gents today we have upgraded our Coventry cluster from 9.1P6 to 9.2P2 and we are about 99% working we just have a strange issue with SAP database servers NFS mounts. When the server is bounced the mounts are attached with no problems but after a few minutes a df –h starts to be very slow reporting the NFS mounted directories and if the databases are started up they hang and a df –h then also hangs. This sometimes recovers enough to then allow a df –h to work again but the databases are a lost cause right now.

 

In the server messages we get lots of entries for the SVM

 

Jan 23 07:01:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying

Jan 23 07:01:47 jwukccsbci last message repeated 5 times

Jan 23 07:02:07 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

Jan 23 07:02:07 jwukccsbci last message repeated 5 times

Jan 23 07:02:27 jwukccsbci kernel: nfs: server JWUKCSVM01 not responding, still trying

Jan 23 07:02:47 jwukccsbci last message repeated 5 times

Jan 23 07:02:48 jwukccsbci kernel: nfs: server JWUKCSVM01 OK

 

Is there anything that would of changed in the upgrade to lock down NFS or changes options that we might need to change back.

 

The redhat servers are an old kernel version 2.6.18-371.el5 that has some bugs but this was working fine before the filer upgrade was carried out.

 

 

Regards

Mark

Data Centre Sysadmin Team

Managed Services

Phone:- 02476 694455 Ext 2567

The Sysadmin Team promoting PCMS Values ~Integrity~Respect~Commitment~ ~Continuous Improvement~

The information contained in this e-mail is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material.  If you are not the intended recipient of this e-mail, the use of this information or any disclosure, copying or distribution is prohibited and may be unlawful.  If you received this in error, please contact the sender and delete the material from any computer.  The views expressed in this e-mail may not necessarily be the views of the PCMS Group plc and should not be taken as authority to carry out any instruction contained.   The PCMS Group reserves the right to monitor and examine the content of all e-mails.

 

The PCMS Group plc is a company registered in England and Wales with company number 1459419 whose registered office is at PCMS House, Torwood Close, Westwood Business Park, Coventry CV4 8HX, United Kingdom. VAT No: GB 705338743