Anybody else out ther using FDDI for their NetApp and using Sun's as clients?
We have been seeing some odd behavior at times.
Looking through the logs on the clients, we have "NFS server not responding" errors (for 0 sec duration) across all the clients. Those that beat more heavily on the NetApp (like one box that constitutes 70% of the traffic) has these every few minutes.
We have also had the NetApp stop serving NFS over the FDDI interface. Taking the interface down and bringing it back up has solved the problem. But what is the root of this problem? Bad NIC?
Any suggestions?
Thanks.
Alex
What's the rest of the network architecture? Are the Suns on FDDI or on the other side of a router or switch?
-dave
On Thu, 4 Sep 1997, Alexei Rodriguez wrote:
Anybody else out ther using FDDI for their NetApp and using Sun's as clients?
We have been seeing some odd behavior at times.
Looking through the logs on the clients, we have "NFS server not responding" errors (for 0 sec duration) across all the clients. Those that beat more heavily on the NetApp (like one box that constitutes 70% of the traffic) has these every few minutes.
We have also had the NetApp stop serving NFS over the FDDI interface. Taking the interface down and bringing it back up has solved the problem. But what is the root of this problem? Bad NIC?
Any suggestions?
Thanks.
Alex
+--- In our lifetime, Dave Pascoe dave@mathworks.com wrote: | | What's the rest of the network architecture? | Are the Suns on FDDI or on the other side of a router or switch?
I suppose that info would help :)
All the Sun's talk to the filer via FDDI. They are on a ring of their own (not attached to anything else). We use 3com linkbuilders as the "hubs"
I have been trying to correlate the "outages" with high load periods on the filer. But on avg we run about 3000 nfsops. So the high load period lasts quite a while :)
Alexei
Alex, We have our NetApp on a private FDDI ring (DAS) with Auspex, Solaris, and HP systems - the Solaris and HP systems are clients. We are using NFS v3 with our Solaris clients but only udp, not tcp - not seen any of the problems you have described - have you confirmed that your FDDI ring is good (damaged cable, etc)? I would say bad hardware if you know you have good cables, etc....
On 04-Sep-97 Alexei Rodriguez wrote:
Anybody else out ther using FDDI for their NetApp and using Sun's as clients?
We have been seeing some odd behavior at times.
Looking through the logs on the clients, we have "NFS server not responding" errors (for 0 sec duration) across all the clients. Those that beat more heavily on the NetApp (like one box that constitutes 70% of the traffic) has these every few minutes.
We have also had the NetApp stop serving NFS over the FDDI interface. Taking the interface down and bringing it back up has solved the problem. But what is the root of this problem? Bad NIC?
Any suggestions?
Thanks.
Alex
--- Tom Wike Texas Instruments, Inc. Unix System Administration 7839 Churchill Way M/S 3984 Email: t-wike@ti.com Dallas, Texas 75251 Pager: (972)598-1496 Phone: (972)917-1252
On Thu, Sep 04, 1997 at 09:59:11AM -0400, Alexei Rodriguez said:
Looking through the logs on the clients, we have "NFS server not responding" errors (for 0 sec duration) across all the clients. Those that beat more heavily on the NetApp (like one box that constitutes 70% of the traffic) has these every few minutes.
These occur using full duplex 100T and suns as clients as well. So I don't think that FDDI is the key. I think I remember reading that netapp had fixed this, or were working on a fix.
On Thu, 4 Sep 1997, Michael Douglass wrote:
Looking through the logs on the clients, we have "NFS server not responding" errors (for 0 sec duration) across all the clients. Those that beat more heavily on the NetApp (like one box that constitutes 70% of the traffic) has these every few minutes.
These occur using full duplex 100T and suns as clients as well. So I don't think that FDDI is the key. I think I remember reading that netapp had fixed this, or were working on a fix.
Yes, in more ancient times the network was the bottleneck (in general). Today, server loading is more likely the culprit, in general.
That's what I've found with our NetApps (2xF330s), which are both 100BaseT-connected to a Cisco Catalyst 5000. Previously they were also FDDI-connected (backups ran through the FDDI). Whenever we see "NFS server not responding" we immediately check 'systat' on the F330s and find that they're operating near peak, and the suspicion is (not being a NetApp designer) that response time is degraded enough that the NFS timers have timed out on the client waiting for a response.
We've gotten around this somewhat (not sure how useful it is though) by adjusting (raising) timeo in /etc/vfstab (Solaris 2.x). This lessens the number of error messages but I think that's just masking the true problem of response time.
-- Dave Pascoe | mailto:dave@mathworks.com | Voice: 508.647.7362 KM3T | http://www.mathworks.com | FAX: 508.647.7002 PGP fingerprint: 53 AD 71 88 2F AA 45 AC D0 2E 68 91 71 77 39 AF
Whenever we see "NFS server not responding" we immediately check 'systat' on the F330s and find that they're operating near peak, and the suspicion is (not being a NetApp designer) that response time is degraded enough that the NFS timers have timed out on the client waiting for a response.
One doesn't need to be a NetApp designer to guess that one - an NFS client will time out if the server doesn't respond fast enough, regardless of whether the server is a filer or not.
(Basically, there are two levels of timeout-and-retry with NFS.
NFS runs atop ONC RPC. When ONC RPC runs atop "unreliable" transports such as UDP, it will retransmit a request if it doesn't get a reply quickly enough. It does that a small number of times, and then returns an "timed out" error to its caller. When it runs atop "reliable" transports such as TCP, it leaves that retransmission up to the transport layer - but if it doesn't get a response back quickly enough, for a value of "quickly enough" larger than for the unreliable-transport retransmission, it gives up and returns a "timed out" error to its caller, on the theory that the server presumably got the request - as the transport didn't return a "connection timed out" error - but somehow didn't manage to handle it or get a reply back.
Most callers probably give up if RPC returns a "timed out" error. That's what NFS does with a soft mount. However, with a hard mount, NFS will log an "NFS server not responding" error, and make another RPC call, and if that times out, it'll make another call, until it gets one that succeeds or, if the mount was with "intr", somebody interrupts the loop with a signal.)
So if you peg a server, you could get "NFS server not responding". I think there may be, hiding somewhere around here, a set of rules for running on a filer some of the undocumented commands talked about in another thread to get information to see what the bottleneck is (main memory? NVRAM? disks? CPU?) and see what needs to be done to remove it; Tech Support might have that (Beepy?).
On Thu, 4 Sep 1997, Alexei Rodriguez wrote:
Anybody else out ther using FDDI for their NetApp and using Sun's as clients?
Yep. Eight dual-attached Ultras to an F220 and an F230, also dual-attached. All are on the same subnet, talking via two Cisco 1400's. I am using NFSv2 with UDP mounts.
Looking through the logs on the clients, we have "NFS server not responding" errors (for 0 sec duration) across all the clients. Those that beat more heavily on the NetApp (like one box that constitutes 70% of the traffic) has these every few minutes.
We have also had the NetApp stop serving NFS over the FDDI interface. Taking the interface down and bringing it back up has solved the problem. But what is the root of this problem? Bad NIC?
I used to see this when I had a misconfigured news server. The nnrpd's weren't find the overview files, and NFS traffic skyrocketed as each nnrpd scanned entire newsgroups to build up overview data on the fly. NFS timeouts and non-ping always occurred during very busy periods though. The load meter on the 1400's peaked at nearly 100%, and usually hovered around 60%. Not only would the Netapps stop responding to pings, but some of the Ultras too.
Once I fixed the overviews problem, all NFS-related problems disappeared as well.