known problems with linux 2.4.21 kernels?

List overview All Threads
Download

newer

older

happy new year

RE: Poor man's redundant storage

Stefan Funke

23 Jan 2004 23 Jan '04

10:50 a.m.

Hi all,

we've found some odd failure messages at all linux machines running 2.4.21 here. Those machines report "NFS: server filerX-vif1 not responding, timed out" over the whole day again and again.

All machines are either some DELL 2650 or HP Proliant DL360/DL380. All machines have a "Broadcom Corporation NetXtreme" Gigbit NIC. It seems that the same type of machines with linux kernel 2.4.18 have no NFS problems.

Has anyone seen the same problems and if so - is it a kernel problem? Is 2.4.23 more stable with NFS or should I downgrade everything to 2.4.18? Someone from netapp here with linux customers? ;-)

Greetings,

-- Stefan Funke eMail : bundy@arcor-ip.de Arcor AG & Co. KG Otto-Volger-Strasse 19 fax : ++49-(0)6196 587-705 D-65843 Sulzbach PGP Key : http://tbd.arcor.de/keys/bundy.asc

Show replies by date

Steve Losen

23 Jan 23 Jan

1:44 p.m.

...

Hi all,

we've found some odd failure messages at all linux machines running 2.4.21 here. Those machines report "NFS: server filerX-vif1 not responding, timed out" over the whole day again and again.

All machines are either some DELL 2650 or HP Proliant DL360/DL380. All machines have a "Broadcom Corporation NetXtreme" Gigbit NIC. It seems that the same type of machines with linux kernel 2.4.18 have no NFS problems.

Has anyone seen the same problems and if so - is it a kernel problem? Is 2.4.23 more stable with NFS or should I downgrade everything to 2.4.18? Someone from netapp here with linux customers? ;-)

I have seen that error on Linux when using NFS over UDP, including NFS servers other than a Netapp. I switched to TCP for all my NFS mounts and haven't had any trouble. Here is my /etc/fstab entry:

filer:/vol/vol0/dir /dir nfs rw,hard,intr,tcp,bg 0 0

I'm running 2.4.20-24.7 (RH 7.3).

Steve Losen scl@virginia.edu phone: 434-924-0640

University of Virginia ITC Unix Support

Stefan Funke

1:53 p.m.

On Fri, 23 Jan 2004 08:44:25 -0500 Steve Losen scl@sasha.acc.Virginia.EDU wrote:

...

...
Hi all,

we've found some odd failure messages at all linux machines running 2.4.21 here. Those machines report "NFS: server filerX-vif1 not responding, timed out" over the whole day again and again.

All machines are either some DELL 2650 or HP Proliant DL360/DL380. All machines have a "Broadcom Corporation NetXtreme" Gigbit NIC. It seems that the same type of machines with linux kernel 2.4.18 have no NFS problems.

Has anyone seen the same problems and if so - is it a kernel problem? Is 2.4.23 more stable with NFS or should I downgrade everything to 2.4.18? Someone from netapp here with linux customers? ;-)

I have seen that error on Linux when using NFS over UDP, including NFS servers other than a Netapp. I switched to TCP for all my NFS mounts and haven't had any trouble. Here is my /etc/fstab entry:

filer:/vol/vol0/dir /dir nfs rw,hard,intr,tcp,bg 0 0

I'm running 2.4.20-24.7 (RH 7.3).

After investigating it we think it might be a problem with the broadcom NIC drivers in 2.4.21. All hosts with 2.4.21 produce heave tx errors and collisions on all switches. We will try new kernels on a few machines now - hopefully the NFS problems will disappear without network problems.

Greetings,

-- Stefan Funke eMail : bundy@arcor-ip.de Arcor AG & Co. KG phone : ++49-(0)6196 587-775 Otto-Volger-Strasse 19 fax : ++49-(0)6196 587-705 D-65843 Sulzbach PGP Key : http://tbd.arcor.de/keys/bundy.asc

Antonio Varni

9:51 p.m.

I also found that using TCP everywhere is a good thing. On our linux boxes we run 2.4.20-9smp (RedHat 9 kernel patches applied).

For linux boxes, we use the following mount point options: rw,hard,intr,nfsvers=3,wsize=32768,rsize=32768,proto=tcp

On Fri, 23 Jan 2004, Steve Losen wrote:

...

...
Hi all,

we've found some odd failure messages at all linux machines running 2.4.21 here. Those machines report "NFS: server filerX-vif1 not responding, timed out" over the whole day again and again.

All machines are either some DELL 2650 or HP Proliant DL360/DL380. All machines have a "Broadcom Corporation NetXtreme" Gigbit NIC. It seems that the same type of machines with linux kernel 2.4.18 have no NFS problems.

Has anyone seen the same problems and if so - is it a kernel problem? Is 2.4.23 more stable with NFS or should I downgrade everything to 2.4.18? Someone from netapp here with linux customers? ;-)

I have seen that error on Linux when using NFS over UDP, including NFS servers other than a Netapp. I switched to TCP for all my NFS mounts and haven't had any trouble. Here is my /etc/fstab entry:

filer:/vol/vol0/dir /dir nfs rw,hard,intr,tcp,bg 0 0

I'm running 2.4.20-24.7 (RH 7.3).

Steve Losen scl@virginia.edu phone: 434-924-0640

University of Virginia ITC Unix Support

Todd C. Merrill

4 Feb 4 Feb

7:46 p.m.

On Fri, 23 Jan 2004, Steve Losen wrote:

...

I have seen that error on Linux when using NFS over UDP, including NFS servers other than a Netapp. I switched to TCP for all my NFS mounts

We are experiencing similar troubles, with NFS over UDP with many Linux clients, but with various other platforms also. Linux does, however, seem to be the most twitchy and least stable. We are running mostly 2.4.18 kernels.

On some of our filers, we've empirically found no such problems if the UDP transfer size is set to 8 kB instead of the default 32 kB. (The default used to be 8 kB, but changed in ONTAP...around 2 years ago?)

nfs.udp.xfersize 8192

Some of our filers that are still at a transfer size of 32k continue to have problems and we are (slowly) migrating the clients to a transfer size of 8k via an option change in our automount map and a reboot (of the client). We tried changing the transfer size on the fly, or in conjuction with a reboot of the filer, but that resulted in many clients writing null bytes to files if files were (suspected) open across the transfer size change. Whoops...one lesson learned the hard way...

We have been in contact with NetApp tech support, and they suspect when the reassembly queue on the filer fills up, clients experience delays that result in the typical "nfs server not responding" messages logged in the messages files of many clients. Interactively the experience is very annoying, with 5-10-30 second delays on file writes (EMACS autosaves, saving mail files, etc.). It is conjectured that reducing the UDP transfer size will not require the filer to keep as many packets in the queue for reassembly, and hence the chances of having the queue fill up will be much smaller.

Caveat: We are still experimenting with this solution, so the above results are not definitive.

Two site-specific items that may cause your mileage to vary from ours:

o On a high-performance LAN, and with high-performance ops/sec requirements from our filers, the overhead of using TCP versus UDP should not be necessary. ("My kingdom for a TOE!")

o Most of our data transfers are random, small files, not streaming transfers of large files, so the reduction of the UDP transfer size should have a negligible effect on our performance. This has proven true on the highest-performing filers on which we have already made the change.

Until next time...

The MathWorks, Inc. 508-647-7000 x7792 3 Apple Hill Drive, Natick, MA 01760-2098 508-647-7001 FAX tmerrill@mathworks.com http://www.mathworks.com ---

7840

Age (days ago)

7852

Last active (days ago)

toasters@lists.teaparty.net

4 comments

4 participants

tags (0)

participants (4)

Antonio Varni
Stefan Funke
Steve Losen
Todd C. Merrill