NFS fails?

List overview All Threads
Download

newer

older

simulator 8.2 doesn't appear to be...

DOT 8.1 disk in a DOT 7.3 filer

Mark Flint

28 Mar 2014 28 Mar '14

11:22 a.m.

Hi all, I have an issue with a FAS3170 and NFS. Seems that hosts are being disconnected and the only thing I can find is a strange message that would seem to point at TOE? :-

Tue Mar 4 21:13:49 GMT [netapp5a: nfsd.tcp.close.idle.notify:warning]: Shutting down idle connection to client (172.17.82.34) where transmit side flow control has been enabled. There are 22 outstanding replies queued on the transmit buffer

However, TOE is turned off. If someone could point me in the right direction, it would be much appreciated :)

~Mark

-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

Attachments:

attachment.html (text/html — 2.1 KB)

Show replies by date

Steiner, Jeffrey

28 Mar 28 Mar

11:41 a.m.

That means that an NFS client ceased responding to inbound network traffic for a long time, and eventually ONTAP closed the transmit buffers down entirely.

What is the workload on your end servers? The only time I've seen this issue occur, other than an actual total failure of network connectivity, is with some Oracle DNFS bugs.

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Flint Sent: Friday, March 28, 2014 12:22 PM To: toasters@teaparty.net Subject: NFS fails?

Hi all, I have an issue with a FAS3170 and NFS. Seems that hosts are being disconnected and the only thing I can find is a strange message that would seem to point at TOE? :-

However, TOE is turned off. If someone could point me in the right direction, it would be much appreciated :)

~Mark

Mark Flint

12:06 p.m.

It’s a large processing farm, using LSF. I’m nor seeing those disconnects anywhere else on the storage system, just from the LSF farm…Not sure of the workload at the time it striated, I’ll ask the HPC guys if they can point me at some info.

Mark Flint mf1@sanger.ac.uk

On 28 Mar 2014, at 11:41, Steiner, Jeffrey Jeffrey.Steiner@netapp.com wrote:

...

That means that an NFS client ceased responding to inbound network traffic for a long time, and eventually ONTAP closed the transmit buffers down entirely.

What is the workload on your end servers? The only time I’ve seen this issue occur, other than an actual total failure of network connectivity, is with some Oracle DNFS bugs.

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mark Flint Sent: Friday, March 28, 2014 12:22 PM To: toasters@teaparty.net Subject: NFS fails?

Hi all, I have an issue with a FAS3170 and NFS. Seems that hosts are being disconnected and the only thing I can find is a strange message that would seem to point at TOE? :-

Tue Mar 4 21:13:49 GMT [netapp5a: nfsd.tcp.close.idle.notify:warning]: Shutting down idle connection to client (172.17.82.34) where transmit side flow control has been enabled. There are 22 outstanding replies queued on the transmit buffer

However, TOE is turned off. If someone could point me in the right direction, it would be much appreciated :)

~Mark

-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

Martin

1:05 p.m.

Interesting thread, I've got a similar situation with a 3140 with 7.3.6P2 connected to an Oracle host over 1GbE using NFS which showing spikes in latency on the host. The Oracle host is showing dropped packets on its storage interface and I am seeing lots of messages logged like:

Mon Mar 21 12:06:33 GMT [Filer1: nfsd.tcp.close.idle.notify:warning]: Shutting down idle connection to client (x.x.x.x) where transmit side flow control has been enabled. There are 131 outstanding replies queued on the transmit buffer. This socket is being closed from the deferred queue.

My thought was the Oracle hosts interface is saturated and its not responding to the NFS acknowledgements in time and so the Netapp is dropping the NFS requests.

The 1GbE interface is being upgraded on the Oracle host but one of my concerns is hitting bugs that have been fixed in later 8.1.x releases once we remove the bottleneck on the Oracle host. Particularly the DNFS and load related bugs. I then read your comment:

"The only time I’ve seen this issue occur, other than an actual total failure of network connectivity, is with some Oracle DNFS bugs."

Is it possible to confirm whether this is simply the Filer flushing unacknowledged NFS requests or if this is actually the DNFS bug?

-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/NFS-fails-tp25611p2561... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.

Steiner, Jeffrey

1:15 p.m.

That messages guarantees that NFS is flushing unacknowledged NFS operations, the only question is why. If you're not having frequently power failures of your database servers (I hope that's a safe assumption!) then you're almost certainly hitting the known DNFS issue.

I strongly recommend getting to 11.2.0.4 if you're using DNFS. It's got a deadlock issue where you'll see these nfsd.tcp.close.idle warnings frequently, usually with stalls in IO that can last a couple minutes. I can't think of any risk of upgrading to 10Gb. In addition, I would recommend patching ONTAP up to 7.3.7P2 in order to get an ONTAP patch related to NFS flow control.

If all you're seeing is latency spikes, that's probably a different issue. These NFS flow control messages are usually associated with total hangs that last up to 2 minutes, although not usually that bad.

Don't let this scare you away from DNFS, though. The bugs in question existed for many years any nobody noticed until recently. They're extremely rare.

-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Martin Sent: Friday, March 28, 2014 2:06 PM To: toasters@teaparty.net Subject: RE: NFS fails?

My thought was the Oracle hosts interface is saturated and its not responding to the NFS acknowledgements in time and so the Netapp is dropping the NFS requests.

"The only time I’ve seen this issue occur, other than an actual total failure of network connectivity, is with some Oracle DNFS bugs."

Is it possible to confirm whether this is simply the Filer flushing unacknowledged NFS requests or if this is actually the DNFS bug?

-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/NFS-fails-tp25611p2561... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.

_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Mark Flint

2:52 p.m.

We’re not using DNFS…….or PNFS :)

Mark Flint mf1@sanger.ac.uk

On 28 Mar 2014, at 13:15, Steiner, Jeffrey Jeffrey.Steiner@netapp.com wrote:

...

That messages guarantees that NFS is flushing unacknowledged NFS operations, the only question is why. If you're not having frequently power failures of your database servers (I hope that's a safe assumption!) then you're almost certainly hitting the known DNFS issue.

I strongly recommend getting to 11.2.0.4 if you're using DNFS. It's got a deadlock issue where you'll see these nfsd.tcp.close.idle warnings frequently, usually with stalls in IO that can last a couple minutes. I can't think of any risk of upgrading to 10Gb. In addition, I would recommend patching ONTAP up to 7.3.7P2 in order to get an ONTAP patch related to NFS flow control.

If all you're seeing is latency spikes, that's probably a different issue. These NFS flow control messages are usually associated with total hangs that last up to 2 minutes, although not usually that bad.

Don't let this scare you away from DNFS, though. The bugs in question existed for many years any nobody noticed until recently. They're extremely rare.

-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Martin Sent: Friday, March 28, 2014 2:06 PM To: toasters@teaparty.net Subject: RE: NFS fails?

Interesting thread, I've got a similar situation with a 3140 with 7.3.6P2 connected to an Oracle host over 1GbE using NFS which showing spikes in latency on the host. The Oracle host is showing dropped packets on its storage interface and I am seeing lots of messages logged like:

Mon Mar 21 12:06:33 GMT [Filer1: nfsd.tcp.close.idle.notify:warning]: Shutting down idle connection to client (x.x.x.x) where transmit side flow control has been enabled. There are 131 outstanding replies queued on the transmit buffer. This socket is being closed from the deferred queue.

My thought was the Oracle hosts interface is saturated and its not responding to the NFS acknowledgements in time and so the Netapp is dropping the NFS requests.

The 1GbE interface is being upgraded on the Oracle host but one of my concerns is hitting bugs that have been fixed in later 8.1.x releases once we remove the bottleneck on the Oracle host. Particularly the DNFS and load related bugs. I then read your comment:

"The only time I’ve seen this issue occur, other than an actual total failure of network connectivity, is with some Oracle DNFS bugs."

Is it possible to confirm whether this is simply the Filer flushing unacknowledged NFS requests or if this is actually the DNFS bug?

-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/NFS-fails-tp25611p2561... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

tmac

3:01 p.m.

It should be nearly painless to turn off flow-control (both send/recevice) on the filers and the switches. If the flow control gets triggered on the switch, it could (and likely will) propoate to the NetApp and the interface will basically stop until it is told to start again when it receives the right info from the switch.

I know there may be other bugs out there, but (if you are using 10 GigE), it is certainly worth a shot to turn off all flow control. It is the most current best practice.

--tmac

*Tim McCarthy* *Principal Consultant*

Clustered ONTAP Clustered ONTAP NCDA ID: XK7R3GEKC1QQ2LVD RHCE6 110-107-141https://www.redhat.com/wapps/training/certification/verify.html?certNumber=110-107-141&isSearch=False&verify=Verify NCSIE ID: C14QPHE21FR4YWD4 Expires: 08 November 2014 Current until Aug 02, 2016 Expires: 08 November 2014

On Fri, Mar 28, 2014 at 10:52 AM, Mark Flint mf1@sanger.ac.uk wrote:

...

We're not using DNFS.......or PNFS :)

Mark Flint mf1@sanger.ac.uk

On 28 Mar 2014, at 13:15, Steiner, Jeffrey Jeffrey.Steiner@netapp.com wrote:

...
That messages guarantees that NFS is flushing unacknowledged NFS

operations, the only question is why. If you're not having frequently power failures of your database servers (I hope that's a safe assumption!) then you're almost certainly hitting the known DNFS issue.

...
I strongly recommend getting to 11.2.0.4 if you're using DNFS. It's got

a deadlock issue where you'll see these nfsd.tcp.close.idle warnings frequently, usually with stalls in IO that can last a couple minutes. I can't think of any risk of upgrading to 10Gb. In addition, I would recommend patching ONTAP up to 7.3.7P2 in order to get an ONTAP patch related to NFS flow control.

...
If all you're seeing is latency spikes, that's probably a different

issue. These NFS flow control messages are usually associated with total hangs that last up to 2 minutes, although not usually that bad.

...
Don't let this scare you away from DNFS, though. The bugs in question

existed for many years any nobody noticed until recently. They're extremely rare.

...
-----Original Message----- From: toasters-bounces@teaparty.net [mailto:

toasters-bounces@teaparty.net] On Behalf Of Martin

...
Sent: Friday, March 28, 2014 2:06 PM To: toasters@teaparty.net Subject: RE: NFS fails?

Interesting thread, I've got a similar situation with a 3140 with

7.3.6P2 connected to an Oracle host over 1GbE using NFS which showing spikes in latency on the host. The Oracle host is showing dropped packets on its storage interface and I am seeing lots of messages logged like:

...
Mon Mar 21 12:06:33 GMT [Filer1: nfsd.tcp.close.idle.notify:warning]: Shutting down idle connection to client (x.x.x.x) where transmit side

flow control has been enabled. There are 131 outstanding replies queued on the transmit buffer. This socket is being closed from the deferred queue.

...
My thought was the Oracle hosts interface is saturated and its not

responding to the NFS acknowledgements in time and so the Netapp is dropping the NFS requests.

...
The 1GbE interface is being upgraded on the Oracle host but one of my

concerns is hitting bugs that have been fixed in later 8.1.x releases once we remove the bottleneck on the Oracle host. Particularly the DNFS and load related bugs. I then read your comment:

...
"The only time I've seen this issue occur, other than an actual total

failure of network connectivity, is with some Oracle DNFS bugs."

...
Is it possible to confirm whether this is simply the Filer flushing

unacknowledged NFS requests or if this is actually the DNFS bug?

...
-- View this message in context:

http://network-appliance-toasters.10978.n7.nabble.com/NFS-fails-tp25611p2561...

...
Sent from the Network Appliance - Toasters mailing list archive at

Nabble.com.

...

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

tmac

11:52 a.m.

Check Your switch ports Check your NetApp ports

Make SURE that Flow Control is OFF

Check on the Filer: ifstat <port>

On the switch: show int <port>

You should see NONE. If not none, then flowcontrol is on....find a way to get it off.

--tmac

*Tim McCarthy* *Principal Consultant*

On Fri, Mar 28, 2014 at 7:22 AM, Mark Flint mf1@sanger.ac.uk wrote:

...

Hi all, I have an issue with a FAS3170 and NFS. Seems that hosts are being disconnected and the only thing I can find is a strange message that would seem to point at TOE? :-

*Tue Mar 4 21:13:49 GMT [netapp5a: nfsd.tcp.close.idle.notify:warning]: Shutting down idle connection to client (172.17.82.34) where transmit side flow control has been*

enabled. There are 22 outstanding replies queued on the transmit buffer*

However, TOE is turned off. If someone could point me in the right direction, it would be much appreciated :)

~Mark

-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Steiner, Jeffrey

12:11 p.m.

This isn't ethernet flow control, this is NFS flow control. Ordinarily, that's two totally different things, HOWEVER there is a link in some cases.

Ethernet flow control is just a mechanism where a receiver can tell a second to cease transmission. We strongly discourage use of ethernet flow control. The problem is ethernet flow control is at the physical layer. If you have a lot of clients attached to one filer and one of those clients gets into trouble and starts sending flow control requests, all transmission on the filer NIC stops. The end result is that your filer is only as fast as your slowest client. The problem is especially exaggerated with gigabit hosts and 10Gb filers. The filer is capable of sending data much faster than a client can receive it, and you can run into lots of flow control activity. Most users will never see a problem, but some will. It's better to disable flow control and let the clients just drop packets. TCP/IP stacks are designed to deal with packet loss.

NFS flow control helps an NFS server project itself from a malfunctioning client. If an NFS client keeps asking for data but never acknowledges receipt, the output TCP buffers on the NFS server would fill up. NFS flow control kicks to stop this. The NFS server will stop transmitting data if there are too many unacknowledged operations.

The message shown below means an NFS client went a full 2 minutes without acknowledging any transmissions. If a NFS client lost power, you're guaranteed to see this message when ONTAP gives up waiting and just clears the remaining data in the buffer.

There are, however, a couple of bugs where ethernet flow control can lead to an NFS flow control glitch which can stall IO. I'd recommend opening a case on this to see if you might be affected.

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of tmac Sent: Friday, March 28, 2014 12:52 PM To: Mark Flint Cc: Toasters Subject: Re: NFS fails?

Check Your switch ports Check your NetApp ports

Make SURE that Flow Control is OFF

Check on the Filer: ifstat <port>

On the switch: show int <port>

You should see NONE. If not none, then flowcontrol is on....find a way to get it off.

--tmac

Tim McCarthy Principal Consultant

[Image removed by sender.] [Image removed by sender.] [Image removed by sender.]

On Fri, Mar 28, 2014 at 7:22 AM, Mark Flint <mf1@sanger.ac.ukmailto:mf1@sanger.ac.uk> wrote: Hi all, I have an issue with a FAS3170 and NFS. Seems that hosts are being disconnected and the only thing I can find is a strange message that would seem to point at TOE? :-

However, TOE is turned off. If someone could point me in the right direction, it would be much appreciated :)

~Mark

_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

4138

Age (days ago)

4138

Last active (days ago)

toasters@lists.teaparty.net

8 comments

4 participants

tags (0)

participants (4)

Mark Flint
Martin
Steiner, Jeffrey
tmac