Here's the description and workaround from NOW for Bug ID 29146. It might help with your problem.
Description:
In Data ONTAP 5.3.4 and earlier releases the default UDP transfer size is 8192. In 5.3.5 and later releases the default is 32768. This larger value improves performance in some situations, but may cause problems in others. Problem 1 - 100-Mbit client is reading from Gigabit interface With the larger default, problems may result when the Gigabit interface on the filer is sending to a 100-Mbit client. Switches can drop packets on the outbound 100-Mbit port because too many packets become queued due to the difference in line speeds and the larger transfer size. Few complete 32K UDP datagrams are received by the client, and the client assumes the filer has gone away and increases its delay between retries. Poor performance may be seen. If a client does not specify the rsize and wsize parameters in the mount command, the filer's default UDP transfer size is used. If a client does specify rsize and wsize, and the values are larger than the default, then the default is used. This means that you may see problems after you upgrade from 5.3.4 (or earlier release) to 5.3.5 (or later release) if the resulting transfer size is 32768. Problem 2 - Multiple clients are reading from 100-Mbit interface When the UDP transfer size is 32768 and multiple clients are reading from a 100-Mbit interface, the driver transmit queue on the filer can overflow more readily and packets may be discarded. This can lead to erratic or poor performance on the clients.
Workaround:
If you have only 100-Mbit clients, the easiest workaround is to change the default UDP transfer size on the filer. First, display the current value: options nfs.udp.xfersize If the value displayed is greater than 8192, then change the value: options nfs.udp.xfersize 8192 If you have both Gigabit and 100-Mbit clients, instead of doing the above you may just want to insert "rsize=8192,wsize=8192" into the mount commands on your 100-Mbit clients and set the filer's default transfer size to 32768. By doing this, the Gigabit clients can benefit from the larger transfer size. If you have only Gigabit interfaces on the filer and the clients, the default UDP transfer size can be set to 32768 to improve performance. Regardless of the transfer size, if you have Sun clients with Sun Gigabit/2.0 NICs, you should enable transmit flow control on the Sun NIC and ensure that receive flow control is enabled on the switch port. The Sun Gigabit/2.0 NIC drops packets because it can't keep up with the filer. In such situations, enabling flow control can substantially improve performance. If you are using an FDDI interface in the filer and a UDP transfer size of 32768, you might want to make sure that all FDDI clients attached to the ring can handle the larger burst in traffic due to the increased transfer size.
Dave
-----Original Message----- From: Jeff Bryer [mailto:bryer@sfu.ca] Sent: Friday, January 24, 2003 12:47 PM To: Pawel Rogocz Cc: toasters@mathworks.com Subject: Re: Filer setting tcp.wsize to 0 -> NFS timeouts
We've experienced the same behaviour. On two separate systems. An F740 running 6.1.2R3 connected to a 420R running Sol 8 using a SUN GigE card. And on an F810 running 6.2.1R2 connected to a V880 running Sol 8 using the onboard GigE port.
We've switched to running NFS v3 over UDP on the F740 (which is the only GigE we have in production on our NetApps). We haven't had any performance issues since. This filer is housing DB/2.
We do have a co-locate F740 connected via GigE to a Linux RedHat machine, but it has only ever been configured to use NFS v2 over UDP with an 8K window size. There is some interest to switch it over to TCP to see if the problems with reducing window size happen there.
On Thu, Jan 23, 2003 at 10:49:49PM -0800, Pawel Rogocz wrote:
I have been having NFS performance problems with our F820 / 6.2.1R2 for quite some time now. The filer has a GigE card, but when we start pushing about 10MB/sec ( 50 % CPU load ) we get numerous NFS timeouts from
solaris
clients mounting the filer via tcp. None of the filer stats (statit et
all)
show any bottleneck on any of the disks. Looking at network traces from a mirrored port, I see the filer at some point reducing tcp window size to 0, down from its normal value - 26280. Looks like it is running out of juice. Any ideas what might be causing this ?
Pawel