Hi all,
Just to drag this conversation back to a purely NetApp, purely NFS scenario, I'd like to get some help and pointers on how I can solve a problem I had this morning in a more general and useful way.
Let me give you the background details here.
We have a bunch of toasters here, various old F330s, an F520 (soon to be retired) and some F740s. This morning a bunch of people were complaining that their workstations were slow, that home directories were timing out, etc. These people all had their home directories on an F330 running OnTap 5.2.1. It has 192Mb of RAM, four shelves, each with 7 x 4gb disks.
The poor system was simply pinned to the wall by a client. The CPU was hovering between 85% and 100%, it was reading and writing around 2.3Mb/s to the disks constantly. The nfsstats told me that about 23% of the traffic was writes, the rest was attr lookups and reads. The usual mix of NFS traffic. The cache age was down around 4-5 (it's normally much higher), so I knew it was getting hit hard with writes.
But since the system was on a direct link back to a switch, and since I don't run the network at all and don't have access to it, I couldn't tell which system(s) were beating it up.
We ended up putting in a PC on a repeater to sniff the link between the switch and the NetApp to try and figure out which host(s) was the bad boy.
Once we figured that out, it still didn't help since the two hosts didn't look loaded at all, nor were there any runaway processes sucking up IO that I could find. The clients were both quad processor Suns running Solaris 2.5.1 or 2.6.
I use the following tools to try and figure out what was going on here, and failed. We had to reboot the two systems to solve the problem. Now as a Unix admin, this really pained me, since I should have been able to find the culprits and just kill them off. We used these on the solaris side:
snoop tcpdump lsof top ps (in all kinds of variations).
ethereal (found after the fact, will be used in the future).
And on the NetApp side I used:
nfsstat netstat -n netstat -r sysstat 4
And while they all showed me something, none of them could show me what I needed.
On the NetApp side I needed something to show me the top 10 NFS hosts but IP address, but I couldn't get it to work. The output of 'netstat -r' wasn't a help at all.
On the Solaris side, tcpdump showed me the traffic, but didn't give me a way to relate it back to a specific process. And while lsof showed me processes, it didn't show me which one was writing data and at what rate.
Does anyone have any hints? I'm thinking of upgrading to 5.3.6 at some point, just to bring the F330s upto date with the F740s, but I'm not in a rush really.
Ideally, something on the NetApp side to show me the top NFS clients in terms of Data Rate, or anything would be a god-send. Then something on the Client side to figure out which process(es) were the NFS hogs would also be good.
Thanks, John John Stoffel - Senior Unix Systems Administrator - Lucent Technologies stoffel@lucent.com - http://www.lucent.com - 978-952-7548
You could have set the option to let the filer collect nfs client stats per host:
options nfs.per_client_stats.enable on
Then you would be able to "rsh filer nfsstat -h | more" And you can look for the offending client. Even better would be the following:
rsh options nfs.per_client_stats.enable on rsh filer nfsstat -z (wait a short period of time, seconds or minutes) rsh filer nfsstat -h | more
now you look and see which host is sending the most requests....go to that host and look for the offending process(es).
--tmac
John Stoffel wrote:
Hi all,
Just to drag this conversation back to a purely NetApp, purely NFS scenario, I'd like to get some help and pointers on how I can solve a problem I had this morning in a more general and useful way.
Let me give you the background details here.
We have a bunch of toasters here, various old F330s, an F520 (soon to be retired) and some F740s. This morning a bunch of people were complaining that their workstations were slow, that home directories were timing out, etc. These people all had their home directories on an F330 running OnTap 5.2.1. It has 192Mb of RAM, four shelves, each with 7 x 4gb disks.
The poor system was simply pinned to the wall by a client. The CPU was hovering between 85% and 100%, it was reading and writing around 2.3Mb/s to the disks constantly. The nfsstats told me that about 23% of the traffic was writes, the rest was attr lookups and reads. The usual mix of NFS traffic. The cache age was down around 4-5 (it's normally much higher), so I knew it was getting hit hard with writes.
But since the system was on a direct link back to a switch, and since I don't run the network at all and don't have access to it, I couldn't tell which system(s) were beating it up.
We ended up putting in a PC on a repeater to sniff the link between the switch and the NetApp to try and figure out which host(s) was the bad boy.
Once we figured that out, it still didn't help since the two hosts didn't look loaded at all, nor were there any runaway processes sucking up IO that I could find. The clients were both quad processor Suns running Solaris 2.5.1 or 2.6.
I use the following tools to try and figure out what was going on here, and failed. We had to reboot the two systems to solve the problem. Now as a Unix admin, this really pained me, since I should have been able to find the culprits and just kill them off. We used these on the solaris side:
snoop tcpdump lsof top ps (in all kinds of variations). ethereal (found after the fact, will be used in the future).
And on the NetApp side I used:
nfsstat netstat -n netstat -r sysstat 4
And while they all showed me something, none of them could show me what I needed.
On the NetApp side I needed something to show me the top 10 NFS hosts but IP address, but I couldn't get it to work. The output of 'netstat -r' wasn't a help at all.
On the Solaris side, tcpdump showed me the traffic, but didn't give me a way to relate it back to a specific process. And while lsof showed me processes, it didn't show me which one was writing data and at what rate.
Does anyone have any hints? I'm thinking of upgrading to 5.3.6 at some point, just to bring the F330s upto date with the F740s, but I'm not in a rush really.
Ideally, something on the NetApp side to show me the top NFS clients in terms of Data Rate, or anything would be a god-send. Then something on the Client side to figure out which process(es) were the NFS hogs would also be good.
Thanks, John John Stoffel - Senior Unix Systems Administrator - Lucent Technologies stoffel@lucent.com - http://www.lucent.com - 978-952-7548
-- ******All New Numbers!!!****** ************* *************
Timothy A. McCarthy --> System Engineer, Eastern Region Network Appliance http://www.netapp.com 240-268-2034 Office \ / Page Me at: 240-268-2001 Fax / 888-971-4468
You should be able to turn on client stats on the filer and you would get per-client stats.
options nfs.per_client_stats.enable on
should do it.
alexei
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com]On Behalf Of John Stoffel Sent: Wednesday, November 08, 2000 12:45 PM To: toasters@mathworks.com Subject: Finding who is pounding your NetApp
Hi all,
Just to drag this conversation back to a purely NetApp, purely NFS scenario, I'd like to get some help and pointers on how I can solve a problem I had this morning in a more general and useful way.
Let me give you the background details here.
We have a bunch of toasters here, various old F330s, an F520 (soon to be retired) and some F740s. This morning a bunch of people were complaining that their workstations were slow, that home directories were timing out, etc. These people all had their home directories on an F330 running OnTap 5.2.1. It has 192Mb of RAM, four shelves, each with 7 x 4gb disks.
The poor system was simply pinned to the wall by a client. The CPU was hovering between 85% and 100%, it was reading and writing around 2.3Mb/s to the disks constantly. The nfsstats told me that about 23% of the traffic was writes, the rest was attr lookups and reads. The usual mix of NFS traffic. The cache age was down around 4-5 (it's normally much higher), so I knew it was getting hit hard with writes.
But since the system was on a direct link back to a switch, and since I don't run the network at all and don't have access to it, I couldn't tell which system(s) were beating it up.
We ended up putting in a PC on a repeater to sniff the link between the switch and the NetApp to try and figure out which host(s) was the bad boy.
Once we figured that out, it still didn't help since the two hosts didn't look loaded at all, nor were there any runaway processes sucking up IO that I could find. The clients were both quad processor Suns running Solaris 2.5.1 or 2.6.
I use the following tools to try and figure out what was going on here, and failed. We had to reboot the two systems to solve the problem. Now as a Unix admin, this really pained me, since I should have been able to find the culprits and just kill them off. We used these on the solaris side:
snoop tcpdump lsof top ps (in all kinds of variations). ethereal (found after the fact, will be used in the future).
And on the NetApp side I used:
nfsstat netstat -n netstat -r sysstat 4
And while they all showed me something, none of them could show me what I needed.
On the NetApp side I needed something to show me the top 10 NFS hosts but IP address, but I couldn't get it to work. The output of 'netstat -r' wasn't a help at all.
On the Solaris side, tcpdump showed me the traffic, but didn't give me a way to relate it back to a specific process. And while lsof showed me processes, it didn't show me which one was writing data and at what rate.
Does anyone have any hints? I'm thinking of upgrading to 5.3.6 at some point, just to bring the F330s upto date with the F740s, but I'm not in a rush really.
Ideally, something on the NetApp side to show me the top NFS clients in terms of Data Rate, or anything would be a god-send. Then something on the Client side to figure out which process(es) were the NFS hogs would also be good.
Thanks, John John Stoffel - Senior Unix Systems Administrator - Lucent Technologies stoffel@lucent.com - http://www.lucent.com - 978-952-7548
stoffel@casc.com (John Stoffel) writes:
Just to drag this conversation back to a purely NetApp, purely NFS scenario, I'd like to get some help and pointers on how I can solve a problem I had this morning in a more general and useful way.
Let me give you the background details here.
[...]
an F330 running OnTap 5.2.1.
[...]
The poor system was simply pinned to the wall by a client.
[...]
But since the system was on a direct link back to a switch, and since I don't run the network at all and don't have access to it, I couldn't tell which system(s) were beating it up.
[...]
On the NetApp side I needed something to show me the top 10 NFS hosts but IP address, but I couldn't get it to work.
How about
options nfs.per_client_stats.enable on
and
nfsstat -l
or
nfsstat -h
? This goes back way beyond 5.2.1, I think.
Chris Thompson University of Cambridge Computing Service, Email: cet1@ucs.cam.ac.uk New Museums Site, Cambridge CB2 3QG, Phone: +44 1223 334715 United Kingdom.
While I agree with the responses that one should turn options nfs.per_client_stats.enable on (the performance hit is minor), I urge Netapp to come up with something like an "nfstop" so one can quickly identify which clients are sending the most requests and which one is using the most CPU.
Bruce
John, several folks have already suggested nfsstats -h. But you already know what clients are the problem. Here is what I suggest:
On each client, use lsof to find the processes with open NFS files. This reports on NFS files owned by user quentin, and excludes program binaries and directories:
lsof -N -a -u quentin | grep -vE 'txt|cwd'
For each process, stop it with `kill -STOP <pid>' If the load decreases you may have found your culprit.
Did the contents of the NFS traffic offer any hints? i.e. file ownership, file contents, etc?
John, several folks have already suggested nfsstats -h. But you already know what clients are the problem. Here is what I suggest:
On each client, use lsof to find the processes with open NFS files. This reports on NFS files owned by user quentin, and excludes program binaries and directories:
lsof -N -a -u quentin | grep -vE 'txt|cwd'
For each process, stop it with `kill -STOP <pid>' If the load decreases you may have found your culprit.
Did the contents of the NFS traffic offer any hints? i.e. file ownership, file contents, etc?
Here's another method. I've used it on standalone machines with local disk, but it should work anyway.
First get a copy of the Adrian Cockcroft "porsche" book, Sun Performance and Tuning, Volume 2. Go to page 189. I imagine this stuff is in a Sunworld online paper, but I don't know where.
That page details how to use 'prex' on a Solaris 2.x system to record all I/O through the system for a few seconds. You record it to a buffer, dump the buffer, then extract the info. I then whipped up a perl program to read the extracted info and sort it by PID.
I'll usually get a nice histogram with one or two processes being I/O hogs and the rest much lower. I then know where to look. :-)
This is a general process that would work with any I/O, but it could be filtered to see only NFS or local disks, I'm sure.
I don't know of a more general solution, or one for other than Solaris machines.