I have an brand new F720 running 5.3.6R2. We're exporting two qtrees via NFS and CIFS. When the system is almost completely quiet, I try to tar up about 2M of stuff and it takes FIVE minutes. I also see NFS timeout errors on all NFS clients (the one I'm doing the tar from is specifically solaris 8, but I see the timeouts on linux as well).
I've tried NFS v2 and v3, both TCP and UDP, and nothing seems to help. netstat doesn't seem to show packet collisions (the netowrk ispretty much dead), not a lot of errors (<10%). I'm seeing writes about every 10 seconds (which should mean good performance), and the CPU is mostly idle. Why the hell does a tar of 2M take so bloody long? I can understand if I was moving a gig or so, but 2M?!
I've included important info (this was after I had changed to v2 UDP). Is there something else I should look at?
nfsstat -c
Server rpc: TCP: calls badcalls nullrecv badlen xdrcall 0 0 0 0 0
UDP: calls badcalls nullrecv badlen xdrcall 1354 0 0 0 0
Server nfs: calls badcalls 1306 0
Server nfs V2: (1306 calls) null getattr setattr root lookup readlink read 1 0% 372 28% 2 0% 0 0% 194 15% 0 0% 293 22% wrcache write create remove rename link symlink 0 0% 440 34% 2 0% 1 0% 0 0% 0 0% 0 0% mkdir rmdir readdir statfs 0 0% 0 0% 1 0% 0 0%
Server nfs V3: (0 calls) null getattr setattr lookup access readlink read 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% write create mkdir symlink mknod remove rmdir 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% rename link readdir readdir+ fsstat fsinfo pathconf 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% commit 0 0%
NFS V2 non-blocking request statistics: null getattr setattr root lookup readlink read 1 100% 372 100% 2 100% 0 0% 194 100% 0 0% 141 48% wrcache write create remove rename link symlink 0 0% 440 100% 1 50% 1 100% 0 0% 0 0% 0 0% mkdir rmdir readdir statfs 0 0% 0 0% 1 100% 0 0%
NFS V3 non-blocking request statistics: null getattr setattr lookup access readlink read 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% write create mkdir symlink mknod remove rmdir 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% rename link readdir readdir+ fsstat fsinfo pathconf 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
NFS reply cache statistics: TCP: In progress Delay hits Misses Idempotent Non-idempotent 0 0 0 0 0 UDP: In progress Delay hits Misses Idempotent Non-idempotent 0 5 732 7 42
sysstat 1
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 2% 0 0 0 0 0 0 0 0 0 19 3% 1 0 0 0 1 0 0 0 0 19 2% 0 0 0 0 0 0 0 0 0 19 3% 32 0 0 5 3 0 0 0 0 19 2% 0 0 0 0 0 0 0 0 0 19 6% 0 0 0 0 0 48 128 0 0 19 2% 0 0 0 31 0 0 0 0 0 19 2% 0 0 0 0 0 0 0 0 0 19 2% 0 0 0 0 0 0 0 0 0 19 3% 4 0 0 2 1 0 0 0 0 19 2% 0 0 0 0 0 0 0 0 0 19 2% 0 0 0 0 0 0 0 0 0 19 2% 1 0 0 16 0 0 0 0 0 19 2% 0 0 0 0 0 0 0 0 0 19 2% 0 0 0 0 0 0 0 0 0 19 3% 2 0 0 1 1 0 0 0 0 19 5% 0 0 0 0 0 32 152 0 0 19 2% 0 0 0 0 0 0 0 0 0 19
sysconfig -r
Volume vol0 (root)
RAID group 0
RAID Disk HA.ID HA SHELF BAY CHAN Used (MB/blks) Phys (MB/blks) --------- ----- ------------ ---- -------------- -------------- parity 1.3 1 0 3 FC:A 34500/70656000 35003/71687368 data 1.2 1 0 2 FC:A 34500/70656000 35003/71687368 data 1.0 1 0 0 FC:A 34500/70656000 35003/71687368 data 1.5 1 0 5 FC:A 34500/70656000 35003/71687368 data 1.6 1 0 6 FC:A 34500/70656000 35003/71687368 data 1.1 1 0 1 FC:A 34500/70656000 35003/71687368
Spare disks
RAID Disk HA.ID HA SHELF BAY CHAN Used (MB/blks) Phys (MB/blks) --------- ----- ------------ ---- -------------- -------------- spare 1.4 1 0 4 FC:A 0 35003/71687368
netstat -s
tcp: 19180 packets sent 16501 data packets (3116859 bytes) 647 data packets (355343 bytes) retransmitted 804 ack-only packets (114 delayed) 0 URG only packets 0 window probe packets 1156 window update packets 76 control packets 22021 packets received 15990 acks (for 3112800 bytes) 187 duplicate acks 0 acks for unsent data 13688 packets (5178316 bytes) received in-sequence 618 completely duplicate packets (890780 bytes) 0 old duplicate packets 10 packets with some dup. data (10 bytes duped) 50 out-of-order packets (61320 bytes) 0 packets (0 bytes) of data after window 0 window probes 0 window update packets 1 packet received after close 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 13 connection requests 27 connection accepts 3 bad connection attempts 0 listen queue overflows 32 connections established (including accepts) 40 connections closed (including 1 drop) 8 embryonic connections dropped 15825 segments updated rtt (of 16329 attempts) 531 retransmit timeouts 0 connections dropped by rexmit timeout 0 persist timeouts 0 connections timed out in persist 8 keepalive timeouts 0 keepalive probes sent 8 connections dropped by keepalive 3594 correct ACK header predictions 3474 correct data packet header predictions 1758 PCB cache misses 0 segments dropped at untrusted interface 0 connection requests closed at filter udp: 3671023 datagrams received 0 with incomplete header 0 with bad data length field 0 with bad checksum 0 dropped due to no socket 137 broadcast/multicast datagrams dropped due to no socket 0 dropped due to full socket buffers 3670886 delivered 3665023 datagrams output ip: 7921585 total packets received 0 bad header checksums 0 with size smaller than minimum 0 with size larger than maximum 0 with data size < data length 0 with header length < data size 0 with data length < header length 0 with bad options 0 with incorrect version number 0 packets with spoofed source address 0 packets arrived on wrong port 6499598 fragments received 0 fragments dropped (dup or out of space) 0 malformed fragments dropped 0 overlapping fragments discarded 3966 fragments dropped after timeout 2272723 packets reassembled ok 3693053 packets for this host 790 packets for unknown/unsupported protocol 0 packets forwarded 867 packets not forwardable 0 redirects sent 3685059 packets sent from this host 0 packets sent with fabricated ip header 0 output packets dropped due to no bufs, etc. 0 output packets discarded due to no route 412593 output datagrams fragmented 862312 fragments created 0 datagrams that can't be fragmented icmp: 0 calls to icmp_error 0 errors not generated 'cuz old message was icmp Output histogram: echo reply: 7 0 messages with bad code fields 0 messages < minimum length 0 bad checksums 0 messages with bad length Input histogram: echo reply: 2 echo: 7 time exceeded: 790 7 message responses generated
netstat -i
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Collis Queue e8* 1500 none none 0 0 0 0 0 0 e0 1500 192.168.1 myhost 7929745 5917 4 0 with incorrect version number 0 packets with spoofed source address 0 packets arrived on wrong port 6499598 fragments received 0 fragments dropped (dup or out of space) 0 malformed fragments dropped 0 overlapping fragments discarded 3966 fragments dropped after timeout 2272723 packets reassembled ok 3693053 packets for this host 790 packets for unknown/unsupported protocol 0 packets forwarded 867 packets not forwardable 0 redirects sent 3685059 packets sent from this host 0 packets sent with fabricated ip header 0 output packets dropped due to no bufs, etc. 0 output packets discarded due to no route 412593 output datagrams fragmented 862312 fragments created 0 datagrams that can't be fragmented icmp: 0 calls to icmp_error 0 errors not generated 'cuz old message was icmp Output histogram: echo reply: 7 0 messages with bad code fields 0 messages < minimum length 0 bad checksums 0 messages with bad length Input histogram: echo reply: 2 echo: 7 time exceeded: 790 7 message responses generated
Based upon the description there is no way to tell exactly what the problem is. First off if you are using 10baseT then the 2M transfer will take 2 seconds. If you are using 100baseT (full duplex) then it will take 0.2 seconds.
If you are using 10baseT then the transfer time is correct. If you are using 100baseT (full duplex) then I would take a look at the duplex settings on the filer side and then the switch side. If either side is not set exactly as the other side then it is possible you have a duplex mismatch. In which case I would simply manually set the port that the filer is connected to to full duplex. Then set the filer to full duplex manually:
ifconfig e0 mediatype 100tx-fd
Try the transfer again. You should see favorable results. Note: sometimes autonegotiation works ok but for the most part a lot of us stick to using manually set 100tx-fd on both the switch side and the filer side. Hope that helps. If not you might try contacting NetApp customer support and opening a case
Mike Smith NetApp Escalations Engineer. mikesmit@netapp.com
----- Original Message ----- From: arr@oceanwave.com To: toasters@mathworks.com Sent: Sunday, December 31, 2000 3:10 PM Subject: NFS server timeouts on the clients
I have an brand new F720 running 5.3.6R2. We're exporting two qtrees via
NFS
and CIFS. When the system is almost completely quiet, I try to tar up
about
2M of stuff and it takes FIVE minutes. I also see NFS timeout errors on
all
NFS clients (the one I'm doing the tar from is specifically solaris 8, but
I
see the timeouts on linux as well).
I've tried NFS v2 and v3, both TCP and UDP, and nothing seems to help.
netstat
doesn't seem to show packet collisions (the netowrk ispretty much dead),
not a
lot of errors (<10%). I'm seeing writes about every 10 seconds (which
should
mean good performance), and the CPU is mostly idle. Why the hell does a
tar
of 2M take so bloody long? I can understand if I was moving a gig or so,
but
2M?!
I've included important info (this was after I had changed to v2 UDP). Is there something else I should look at?
nfsstat -c
Server rpc: TCP: calls badcalls nullrecv badlen xdrcall 0 0 0 0 0
UDP: calls badcalls nullrecv badlen xdrcall 1354 0 0 0 0
Server nfs: calls badcalls 1306 0
Server nfs V2: (1306 calls) null getattr setattr root lookup readlink read 1 0% 372 28% 2 0% 0 0% 194 15% 0 0% 293 22% wrcache write create remove rename link symlink 0 0% 440 34% 2 0% 1 0% 0 0% 0 0% 0 0% mkdir rmdir readdir statfs 0 0% 0 0% 1 0% 0 0%
Server nfs V3: (0 calls) null getattr setattr lookup access readlink read 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% write create mkdir symlink mknod remove rmdir 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% rename link readdir readdir+ fsstat fsinfo pathconf 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% commit 0 0%
NFS V2 non-blocking request statistics: null getattr setattr root lookup readlink read 1 100% 372 100% 2 100% 0 0% 194 100% 0 0% 141 48% wrcache write create remove rename link symlink 0 0% 440 100% 1 50% 1 100% 0 0% 0 0% 0 0% mkdir rmdir readdir statfs 0 0% 0 0% 1 100% 0 0%
NFS V3 non-blocking request statistics: null getattr setattr lookup access readlink read 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% write create mkdir symlink mknod remove rmdir 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% rename link readdir readdir+ fsstat fsinfo pathconf 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
NFS reply cache statistics: TCP: In progress Delay hits Misses Idempotent
Non-idempotent
0 0 0 0 0 UDP: In progress Delay hits Misses Idempotent
Non-idempotent
0 5 732 7 42
sysstat 1
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s
Cache
in out read write read write
age
2% 0 0 0 0 0 0 0 0 0
19
3% 1 0 0 0 1 0 0 0 0
19
2% 0 0 0 0 0 0 0 0 0
19
3% 32 0 0 5 3 0 0 0 0
19
2% 0 0 0 0 0 0 0 0 0
19
6% 0 0 0 0 0 48 128 0 0
19
2% 0 0 0 31 0 0 0 0 0
19
2% 0 0 0 0 0 0 0 0 0
19
2% 0 0 0 0 0 0 0 0 0
19
3% 4 0 0 2 1 0 0 0 0
19
2% 0 0 0 0 0 0 0 0 0
19
2% 0 0 0 0 0 0 0 0 0
19
2% 1 0 0 16 0 0 0 0 0
19
2% 0 0 0 0 0 0 0 0 0
19
2% 0 0 0 0 0 0 0 0 0
19
3% 2 0 0 1 1 0 0 0 0
19
5% 0 0 0 0 0 32 152 0 0
19
2% 0 0 0 0 0 0 0 0 0
19
sysconfig -r
Volume vol0 (root)
RAID group 0
RAID Disk HA.ID HA SHELF BAY CHAN Used (MB/blks) Phys
(MB/blks)
----
parity 1.3 1 0 3 FC:A 34500/70656000
35003/71687368
data 1.2 1 0 2 FC:A 34500/70656000
35003/71687368
data 1.0 1 0 0 FC:A 34500/70656000
35003/71687368
data 1.5 1 0 5 FC:A 34500/70656000
35003/71687368
data 1.6 1 0 6 FC:A 34500/70656000
35003/71687368
data 1.1 1 0 1 FC:A 34500/70656000
35003/71687368
Spare disks
RAID Disk HA.ID HA SHELF BAY CHAN Used (MB/blks) Phys
(MB/blks)
----
spare 1.4 1 0 4 FC:A 0
35003/71687368
netstat -s
tcp: 19180 packets sent 16501 data packets (3116859 bytes) 647 data packets (355343 bytes) retransmitted 804 ack-only packets (114 delayed) 0 URG only packets 0 window probe packets 1156 window update packets 76 control packets 22021 packets received 15990 acks (for 3112800 bytes) 187 duplicate acks 0 acks for unsent data 13688 packets (5178316 bytes) received in-sequence 618 completely duplicate packets (890780 bytes) 0 old duplicate packets 10 packets with some dup. data (10 bytes duped) 50 out-of-order packets (61320 bytes) 0 packets (0 bytes) of data after window 0 window probes 0 window update packets 1 packet received after close 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 13 connection requests 27 connection accepts 3 bad connection attempts 0 listen queue overflows 32 connections established (including accepts) 40 connections closed (including 1 drop) 8 embryonic connections dropped 15825 segments updated rtt (of 16329 attempts) 531 retransmit timeouts 0 connections dropped by rexmit timeout 0 persist timeouts 0 connections timed out in persist 8 keepalive timeouts 0 keepalive probes sent 8 connections dropped by keepalive 3594 correct ACK header predictions 3474 correct data packet header predictions 1758 PCB cache misses 0 segments dropped at untrusted interface 0 connection requests closed at filter udp: 3671023 datagrams received 0 with incomplete header 0 with bad data length field 0 with bad checksum 0 dropped due to no socket 137 broadcast/multicast datagrams dropped due to no socket 0 dropped due to full socket buffers 3670886 delivered 3665023 datagrams output ip: 7921585 total packets received 0 bad header checksums 0 with size smaller than minimum 0 with size larger than maximum 0 with data size < data length 0 with header length < data size 0 with data length < header length 0 with bad options 0 with incorrect version number 0 packets with spoofed source address 0 packets arrived on wrong port 6499598 fragments received 0 fragments dropped (dup or out of space) 0 malformed fragments dropped 0 overlapping fragments discarded 3966 fragments dropped after timeout 2272723 packets reassembled ok 3693053 packets for this host 790 packets for unknown/unsupported protocol 0 packets forwarded 867 packets not forwardable 0 redirects sent 3685059 packets sent from this host 0 packets sent with fabricated ip header 0 output packets dropped due to no bufs, etc. 0 output packets discarded due to no route 412593 output datagrams fragmented 862312 fragments created 0 datagrams that can't be fragmented icmp: 0 calls to icmp_error 0 errors not generated 'cuz old message was icmp Output histogram: echo reply: 7 0 messages with bad code fields 0 messages < minimum length 0 bad checksums 0 messages with bad length Input histogram: echo reply: 2 echo: 7 time exceeded: 790 7 message responses generated
netstat -i
Name Mtu Network Address Ipkts Ierrs
Opkts
Oerrs Collis Queue e8* 1500 none none 0 0 0 0 0 0 e0 1500 192.168.1 myhost 7929745
5917
4 0 with incorrect version number 0 packets with spoofed source address 0 packets arrived on wrong port 6499598 fragments received 0 fragments dropped (dup or out of space) 0 malformed fragments dropped 0 overlapping fragments discarded 3966 fragments dropped after timeout 2272723 packets reassembled ok 3693053 packets for this host 790 packets for unknown/unsupported protocol 0 packets forwarded 867 packets not forwardable 0 redirects sent 3685059 packets sent from this host 0 packets sent with fabricated ip header 0 output packets dropped due to no bufs, etc. 0 output packets discarded due to no route 412593 output datagrams fragmented 862312 fragments created 0 datagrams that can't be fragmented icmp: 0 calls to icmp_error 0 errors not generated 'cuz old message was icmp Output histogram: echo reply: 7 0 messages with bad code fields 0 messages < minimum length 0 bad checksums 0 messages with bad length Input histogram: echo reply: 2 echo: 7 time exceeded: 790 7 message responses generated
mikesmit> Based upon the description there is no way to tell exactly what the mikesmit> problem is. First off if you are using 10baseT then the 2M transfer mikesmit> will take 2 seconds. If you are using 100baseT (full duplex) then it mikesmit> will take 0.2 seconds.
The filer is set at 100tx-fd:
slot 0: Ethernet Controller e0 MAC Address: 00:a0:98:00:9d:fd (100tx-fd-up)
I don't have access to the switch, so that's something to check on. We'll also be moving to GigE, so if this is the issue, then that may inadvertently solve the problem.
Thanks for the pointer.
Thanks to everyone who mailed suggestions about checking the speed mismatch. The switch was wedged at half duplex. Netapp should put this under their "troubleshooting NFS connections" page (so people like me don't post obvious questions :/).
http://now.netapp.com/NOW/knowledge/contents/TIP/TIP_8.shtml