You could have set the option to let the filer collect nfs client stats per host:
options nfs.per_client_stats.enable on
Then you would be able to "rsh filer nfsstat -h | more" And you can look for the offending client. Even better would be the following:
rsh options nfs.per_client_stats.enable on rsh filer nfsstat -z (wait a short period of time, seconds or minutes) rsh filer nfsstat -h | more
now you look and see which host is sending the most requests....go to that host and look for the offending process(es).
--tmac
John Stoffel wrote:
Hi all,
Just to drag this conversation back to a purely NetApp, purely NFS scenario, I'd like to get some help and pointers on how I can solve a problem I had this morning in a more general and useful way.
Let me give you the background details here.
We have a bunch of toasters here, various old F330s, an F520 (soon to be retired) and some F740s. This morning a bunch of people were complaining that their workstations were slow, that home directories were timing out, etc. These people all had their home directories on an F330 running OnTap 5.2.1. It has 192Mb of RAM, four shelves, each with 7 x 4gb disks.
The poor system was simply pinned to the wall by a client. The CPU was hovering between 85% and 100%, it was reading and writing around 2.3Mb/s to the disks constantly. The nfsstats told me that about 23% of the traffic was writes, the rest was attr lookups and reads. The usual mix of NFS traffic. The cache age was down around 4-5 (it's normally much higher), so I knew it was getting hit hard with writes.
But since the system was on a direct link back to a switch, and since I don't run the network at all and don't have access to it, I couldn't tell which system(s) were beating it up.
We ended up putting in a PC on a repeater to sniff the link between the switch and the NetApp to try and figure out which host(s) was the bad boy.
Once we figured that out, it still didn't help since the two hosts didn't look loaded at all, nor were there any runaway processes sucking up IO that I could find. The clients were both quad processor Suns running Solaris 2.5.1 or 2.6.
I use the following tools to try and figure out what was going on here, and failed. We had to reboot the two systems to solve the problem. Now as a Unix admin, this really pained me, since I should have been able to find the culprits and just kill them off. We used these on the solaris side:
snoop tcpdump lsof top ps (in all kinds of variations). ethereal (found after the fact, will be used in the future).
And on the NetApp side I used:
nfsstat netstat -n netstat -r sysstat 4
And while they all showed me something, none of them could show me what I needed.
On the NetApp side I needed something to show me the top 10 NFS hosts but IP address, but I couldn't get it to work. The output of 'netstat -r' wasn't a help at all.
On the Solaris side, tcpdump showed me the traffic, but didn't give me a way to relate it back to a specific process. And while lsof showed me processes, it didn't show me which one was writing data and at what rate.
Does anyone have any hints? I'm thinking of upgrading to 5.3.6 at some point, just to bring the F330s upto date with the F740s, but I'm not in a rush really.
Ideally, something on the NetApp side to show me the top NFS clients in terms of Data Rate, or anything would be a god-send. Then something on the Client side to figure out which process(es) were the NFS hogs would also be good.
Thanks, John John Stoffel - Senior Unix Systems Administrator - Lucent Technologies stoffel@lucent.com - http://www.lucent.com - 978-952-7548
-- ******All New Numbers!!!****** ************* *************
Timothy A. McCarthy --> System Engineer, Eastern Region Network Appliance http://www.netapp.com 240-268-2034 Office \ / Page Me at: 240-268-2001 Fax / 888-971-4468