Hi all,
we've found some odd failure messages at all linux machines running 2.4.21 here. Those machines report "NFS: server filerX-vif1 not responding, timed out" over the whole day again and again.
All machines are either some DELL 2650 or HP Proliant DL360/DL380. All machines have a "Broadcom Corporation NetXtreme" Gigbit NIC. It seems that the same type of machines with linux kernel 2.4.18 have no NFS problems.
Has anyone seen the same problems and if so - is it a kernel problem? Is 2.4.23 more stable with NFS or should I downgrade everything to 2.4.18? Someone from netapp here with linux customers? ;-)
Greetings,
Hi all,
we've found some odd failure messages at all linux machines running 2.4.21 here. Those machines report "NFS: server filerX-vif1 not responding, timed out" over the whole day again and again.
All machines are either some DELL 2650 or HP Proliant DL360/DL380. All machines have a "Broadcom Corporation NetXtreme" Gigbit NIC. It seems that the same type of machines with linux kernel 2.4.18 have no NFS problems.
Has anyone seen the same problems and if so - is it a kernel problem? Is 2.4.23 more stable with NFS or should I downgrade everything to 2.4.18? Someone from netapp here with linux customers? ;-)
I have seen that error on Linux when using NFS over UDP, including NFS servers other than a Netapp. I switched to TCP for all my NFS mounts and haven't had any trouble. Here is my /etc/fstab entry:
filer:/vol/vol0/dir /dir nfs rw,hard,intr,tcp,bg 0 0
I'm running 2.4.20-24.7 (RH 7.3).
Steve Losen scl@virginia.edu phone: 434-924-0640
University of Virginia ITC Unix Support
On Fri, 23 Jan 2004 08:44:25 -0500 Steve Losen scl@sasha.acc.Virginia.EDU wrote:
Hi all,
we've found some odd failure messages at all linux machines running 2.4.21 here. Those machines report "NFS: server filerX-vif1 not responding, timed out" over the whole day again and again.
All machines are either some DELL 2650 or HP Proliant DL360/DL380. All machines have a "Broadcom Corporation NetXtreme" Gigbit NIC. It seems that the same type of machines with linux kernel 2.4.18 have no NFS problems.
Has anyone seen the same problems and if so - is it a kernel problem? Is 2.4.23 more stable with NFS or should I downgrade everything to 2.4.18? Someone from netapp here with linux customers? ;-)
I have seen that error on Linux when using NFS over UDP, including NFS servers other than a Netapp. I switched to TCP for all my NFS mounts and haven't had any trouble. Here is my /etc/fstab entry:
filer:/vol/vol0/dir /dir nfs rw,hard,intr,tcp,bg 0 0
I'm running 2.4.20-24.7 (RH 7.3).
After investigating it we think it might be a problem with the broadcom NIC drivers in 2.4.21. All hosts with 2.4.21 produce heave tx errors and collisions on all switches. We will try new kernels on a few machines now - hopefully the NFS problems will disappear without network problems.
Greetings,
I also found that using TCP everywhere is a good thing. On our linux boxes we run 2.4.20-9smp (RedHat 9 kernel patches applied).
For linux boxes, we use the following mount point options: rw,hard,intr,nfsvers=3,wsize=32768,rsize=32768,proto=tcp
On Fri, 23 Jan 2004, Steve Losen wrote:
Hi all,
we've found some odd failure messages at all linux machines running 2.4.21 here. Those machines report "NFS: server filerX-vif1 not responding, timed out" over the whole day again and again.
All machines are either some DELL 2650 or HP Proliant DL360/DL380. All machines have a "Broadcom Corporation NetXtreme" Gigbit NIC. It seems that the same type of machines with linux kernel 2.4.18 have no NFS problems.
Has anyone seen the same problems and if so - is it a kernel problem? Is 2.4.23 more stable with NFS or should I downgrade everything to 2.4.18? Someone from netapp here with linux customers? ;-)
I have seen that error on Linux when using NFS over UDP, including NFS servers other than a Netapp. I switched to TCP for all my NFS mounts and haven't had any trouble. Here is my /etc/fstab entry:
filer:/vol/vol0/dir /dir nfs rw,hard,intr,tcp,bg 0 0
I'm running 2.4.20-24.7 (RH 7.3).
Steve Losen scl@virginia.edu phone: 434-924-0640
University of Virginia ITC Unix Support
On Fri, 23 Jan 2004, Steve Losen wrote:
I have seen that error on Linux when using NFS over UDP, including NFS servers other than a Netapp. I switched to TCP for all my NFS mounts
We are experiencing similar troubles, with NFS over UDP with many Linux clients, but with various other platforms also. Linux does, however, seem to be the most twitchy and least stable. We are running mostly 2.4.18 kernels.
On some of our filers, we've empirically found no such problems if the UDP transfer size is set to 8 kB instead of the default 32 kB. (The default used to be 8 kB, but changed in ONTAP...around 2 years ago?)
nfs.udp.xfersize 8192
Some of our filers that are still at a transfer size of 32k continue to have problems and we are (slowly) migrating the clients to a transfer size of 8k via an option change in our automount map and a reboot (of the client). We tried changing the transfer size on the fly, or in conjuction with a reboot of the filer, but that resulted in many clients writing null bytes to files if files were (suspected) open across the transfer size change. Whoops...one lesson learned the hard way...
We have been in contact with NetApp tech support, and they suspect when the reassembly queue on the filer fills up, clients experience delays that result in the typical "nfs server not responding" messages logged in the messages files of many clients. Interactively the experience is very annoying, with 5-10-30 second delays on file writes (EMACS autosaves, saving mail files, etc.). It is conjectured that reducing the UDP transfer size will not require the filer to keep as many packets in the queue for reassembly, and hence the chances of having the queue fill up will be much smaller.
Caveat: We are still experimenting with this solution, so the above results are not definitive.
Two site-specific items that may cause your mileage to vary from ours:
o On a high-performance LAN, and with high-performance ops/sec requirements from our filers, the overhead of using TCP versus UDP should not be necessary. ("My kingdom for a TOE!")
o Most of our data transfers are random, small files, not streaming transfers of large files, so the reduction of the UDP transfer size should have a negligible effect on our performance. This has proven true on the highest-performing filers on which we have already made the change.
Until next time...
The MathWorks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com ---