Hi folks,
I'm running into a strange problem here, where my users are beating up on an F740 running 6.4.5 (just upgraded, they did the same when it was running 5.3.7RxDy) with a ton of getattr() NFSv3 calls. The load suddenly shoots upto 7,000 nfs ops/sec, the system is using 30-50% of it's CPU, but it's barely touching the disks. The clients are all Solaris 5.x, mostly 5.7 or 5.8 with some 5.6 and 5.9 through in.
I've used the netapp-top command to find the client(s) with the most nfs opertations. I'm root on all of these clients (if not, I turn off access by that client. :-) so I can get on there and do what I want, but it's not easy.
I pretty much know that my users are using ClearCase and clearmake to build software, but tracking down which process/user is beating on the filesystem is making me crazy.
I've tried running ethereal (version 0.8.11) on the client system(s) to look at the packets, but I'm only seeing that a bunch of calls are happening, but not which file(s) they are pointing to.
Using lsof doesn't help either, since the file(s) aren't open in any manner, and since I could easily just miss which one they are looking at.
We've done some simple truss -a -f on a test build, but that doesn't seem to stress out the system in the same way that my users are doing things. Really frustrating.
Any suggestions or hints would be appreciated. Putting a sniffer in line with the client(s) isn't really possible since I don't have one available nor do I want to lug it all over the building to various client systems.
Thanks, John John Stoffel - Senior Unix Systems Administrator - Lucent Technologies stoffel@lucent.com - http://www.lucent.com - 978-952-7548
Are you using UDP or TCP. Are you using a mix of Gig-e and 100mb clients?
--- John Stoffel stoffel@lucent.com wrote:
Hi folks,
I'm running into a strange problem here, where my users are beating up on an F740 running 6.4.5 (just upgraded, they did the same when it was running 5.3.7RxDy) with a ton of getattr() NFSv3 calls. The load suddenly shoots upto 7,000 nfs ops/sec, the system is using 30-50% of it's CPU, but it's barely touching the disks. The clients are all Solaris 5.x, mostly 5.7 or 5.8 with some 5.6 and 5.9 through in.
I've used the netapp-top command to find the client(s) with the most nfs opertations. I'm root on all of these clients (if not, I turn off access by that client. :-) so I can get on there and do what I want, but it's not easy.
I pretty much know that my users are using ClearCase and clearmake to build software, but tracking down which process/user is beating on the filesystem is making me crazy.
I've tried running ethereal (version 0.8.11) on the client system(s) to look at the packets, but I'm only seeing that a bunch of calls are happening, but not which file(s) they are pointing to.
Using lsof doesn't help either, since the file(s) aren't open in any manner, and since I could easily just miss which one they are looking at.
We've done some simple truss -a -f on a test build, but that doesn't seem to stress out the system in the same way that my users are doing things. Really frustrating.
Any suggestions or hints would be appreciated. Putting a sniffer in line with the client(s) isn't really possible since I don't have one available nor do I want to lug it all over the building to various client systems.
Thanks, John John Stoffel - Senior Unix Systems Administrator
- Lucent Technologies stoffel@lucent.com - http://www.lucent.com -
978-952-7548
__________________________________ Do you Yahoo!? Yahoo! Mail Address AutoComplete - You start. We finish. http://promotions.yahoo.com/new_mail
Jerry> Are you using UDP or TCP.
tcp mostly.
Jerry> Are you using a mix of Gig-e and 100mb clients?
Gig-e to the Toaster, 100mb (mostly) on the clients.
I saw some scenarios where using mixed speeds caused problems unless I switched to tcp. The filer was attempting to send stuff fast than the client could accept (gig to 100mb). This would cause the client to re-request packets, and would spiral down into horrible performance on the client. I never saw the filer cpu rise, but I suppose it could be the same. Do you see poor performance as well, or just a spin up in cpu?
--- John Stoffel stoffel@lucent.com wrote:
Jerry> Are you using UDP or TCP.
tcp mostly.
Jerry> Are you using a mix of Gig-e and 100mb clients?
Gig-e to the Toaster, 100mb (mostly) on the clients.
_______________________________ Do you Yahoo!? Win 1 of 4,000 free domain names from Yahoo! Enter now. http://promotions.yahoo.com/goldrush
Jerry> I saw some scenarios where using mixed speeds caused problems Jerry> unless I switched to tcp. The filer was attempting to send Jerry> stuff fast than the client could accept (gig to 100mb). This Jerry> would cause the client to re-request packets, and would spiral Jerry> down into horrible performance on the client. I never saw the Jerry> filer cpu rise, but I suppose it could be the same. Do you see Jerry> poor performance as well, or just a spin up in cpu?
Nope, no performance problems on the clients from what I see. They're builds just crank along. They *seem* to be doing a bunch of parallel builds. If you're familiar with ClearCase, it lets you specify a bunch of hosts to do parallel builds on, so you can farm out a build job across multiple hosts.
I think once I can figure out which file(s) they're doing the getattr() calls against, I can then chase down the problematic change in their build setup or whatever.
Thanks for the suggestions, John John Stoffel - Senior Unix Systems Administrator - Lucent Technologies stoffel@lucent.com - http://www.lucent.com - 978-952-7548
Have you tried using snoop on your Solaris clients? The output (run with no options) includes getattr calls with a filehandle number, and "LOOKUP3" calls sometimes include a FH# and a filename. It seems that with the right options, the data you want is available.
You might try capturing packets for a few seconds into a file, and then using various options until you come across the one that works for you.
Re:
Date: Thu, 26 Aug 2004 14:08:00 -0400 From: "John Stoffel" stoffel@lucent.com To: Jerry juanino@yahoo.com Cc: John Stoffel stoffel@lucent.com, "'NetApps list server'" toasters@mathworks.com Subject: Re: Tracing NFS getattr() calls to a file
Jerry> I saw some scenarios where using mixed speeds caused problems Jerry> unless I switched to tcp. The filer was attempting to send Jerry> stuff fast than the client could accept (gig to 100mb). This Jerry> would cause the client to re-request packets, and would spiral Jerry> down into horrible performance on the client. I never saw the Jerry> filer cpu rise, but I suppose it could be the same. Do you see Jerry> poor performance as well, or just a spin up in cpu?
Nope, no performance problems on the clients from what I see. They're builds just crank along. They *seem* to be doing a bunch of parallel builds. If you're familiar with ClearCase, it lets you specify a bunch of hosts to do parallel builds on, so you can farm out a build job across multiple hosts.
I think once I can figure out which file(s) they're doing the getattr() calls against, I can then chase down the problematic change in their build setup or whatever.
Thanks for the suggestions, John John Stoffel - Senior Unix Systems Administrator - Lucent Technologies stoffel@lucent.com - http://www.lucent.com - 978-952-7548
Brian> Have you tried using snoop on your Solaris clients?
Nope, just ethereal, which seems to be a superset of snoop. I could be wrong of course.
Brian> The output (run with no options) includes getattr calls with a Brian> filehandle number, and "LOOKUP3" calls sometimes include a FH# Brian> and a filename. It seems that with the right options, the data Brian> you want is available.
Ethereal gave if filehandle, I just didn't know how to use it. So it's really the LOOKUP() call I need to find/trace when this happens. With my luck, it will be a single file that get's hit a ba-zillion times and I'll only start tracing after the first ga-zillion. *grin*
Brian> You might try capturing packets for a few seconds into a file, Brian> and then using various options until you come across the one Brian> that works for you.
I've tried that with ethereal, capturing a few minutes of data access, but never found a LOOKUP call. But now that I know better to look explicitly for such calls, I can make a more concerted effort.
Thanks for the help.
John
Hi folks,
I'm running into a strange problem here, where my users are beating up on an F740 running 6.4.5 (just upgraded, they did the same when it was running 5.3.7RxDy) with a ton of getattr() NFSv3 calls. The load suddenly shoots upto 7,000 nfs ops/sec, the system is using 30-50% of it's CPU, but it's barely touching the disks. The clients are all Solaris 5.x, mostly 5.7 or 5.8 with some 5.6 and 5.9 through in.
I've used the netapp-top command to find the client(s) with the most nfs opertations. I'm root on all of these clients (if not, I turn off access by that client. :-) so I can get on there and do what I want, but it's not easy.
I pretty much know that my users are using ClearCase and clearmake to build software, but tracking down which process/user is beating on the filesystem is making me crazy.
How many files are in the source directories where the software builds are happening? Utilities like "make" need to compare modification times on files to decide what needs recompiling. Every .o file needs to be checked against the .c and .h files that it depends on, etc. So even if you just modify one source file, "make" still needs to check everything.
Does "clearmake" run multiple processes/threads to build things in parallel? That would drive up the load even higher, but for shorter duration, since the build would finish faster.
Steve Losen scl@virginia.edu phone: 434-924-0640
University of Virginia ITC Unix Support
On Thu, 26 Aug 2004, John Stoffel wrote:
I'm running into a strange problem here, where my users are beating up on an F740 running 6.4.5 (just upgraded, they did the same when it was running 5.3.7RxDy) with a ton of getattr() NFSv3 calls. The load suddenly shoots upto 7,000 nfs ops/sec, the system is using 30-50% of it's CPU, but it's barely touching the disks. The clients are all Solaris 5.x, mostly 5.7 or 5.8 with some 5.6 and 5.9 through in.
Why, just a few weeks ago I noticed almost *exactly* those same circumstances after an upgrade and a reboot of an F820 (6.5.1R1). In this case, netapp-top.pl (or at least the old version I have?) was giving utterly nonsensical results (including a negative number of ops/sec?) so I just used "nfsstat -r" on the filer, followed up with "snoop" to confirm and identify the culprit hosts.
It seems that the getattr() calls were on the mount point itself, not a file beneath it, which may explain why "lsof -N" was confused. This was a case where we had migrated the root volume on the filer from an FC-9 shelf to a DS14 shelf, so vol0 was an entirely new volume. Prior to the work on the filer we had unmounted filesystems from the servers and machines we cared about, and expected "NFS stale file handle" errors on any client machines that we missed and would just reboot them later - but found that on rebooting from the new vol0, the Solaris clients that were freaking out and looping like you described were the ones that we hadn't touched. Oddly enough, they *didn't* report stale file handles as we'd expected, and things appeared to be working(!) - except that something in the NFS client was causing the odd traffic.
A quick and dirty "fuser -kc /troubled/mount/point" and umount/mount cycle cleared it up. Not at all sure if this applies to your situation, but the symptoms you describe exactly match what we saw.
-- Chris
"Chris" == Chris Lamb skeezics@selectmetrics.com writes:
Chris> Why, just a few weeks ago I noticed almost *exactly* those same Chris> circumstances after an upgrade and a reboot of an F820 Chris> (6.5.1R1). In this case, netapp-top.pl (or at least the old Chris> version I have?) was giving utterly nonsensical results Chris> (including a negative number of ops/sec?) so I just used Chris> "nfsstat -r" on the filer, followed up with "snoop" to confirm Chris> and identify the culprit hosts.
Yeah, the version of netapp-top I was running was also showing bad results. I ended up hacking my own limited perl script to show me the data I wanted. Maybe I'll update the netapp-top script to work better someday.
Chris> It seems that the getattr() calls were on the mount point Chris> itself, not a file beneath it, which may explain why "lsof -N" Chris> was confused. This was a case where we had migrated the root Chris> volume on the filer from an FC-9 shelf to a DS14 shelf, so vol0 Chris> was an entirely new volume. Prior to the work on the filer we Chris> had unmounted filesystems from the servers and machines we Chris> cared about, and expected "NFS stale file handle" errors on any Chris> client machines that we missed and would just reboot them later Chris> - but found that on rebooting from the new vol0, the Solaris Chris> clients that were freaking out and looping like you described Chris> were the ones that we hadn't touched. Oddly enough, they Chris> *didn't* report stale file handles as we'd expected, and things Chris> appeared to be working(!) - except that something in the NFS Chris> client was causing the odd traffic.
This is interesting, but not quite what I've run into. I was having the problem when running 5.3.7..., then when we rebooted the server into 6.4.5 (nice smooth upgrade process btw) we didn't reboot any clients. And then we had the same problem again a few days later. No client reboots or anything.
I'm pretty sure it's the users doing something with parallel builds but finding out the file(s) they're poking at would be the first step in figuring out what they're doing here.
Chris> A quick and dirty "fuser -kc /troubled/mount/point" and Chris> umount/mount cycle cleared it up. Not at all sure if this Chris> applies to your situation, but the symptoms you describe Chris> exactly match what we saw.
Thank you for the followup. I'll have to keep the fuser in mind when I see this happening again.
John