On Thu, 26 Aug 2004, John Stoffel wrote:
I'm running into a strange problem here, where my users are beating up on an F740 running 6.4.5 (just upgraded, they did the same when it was running 5.3.7RxDy) with a ton of getattr() NFSv3 calls. The load suddenly shoots upto 7,000 nfs ops/sec, the system is using 30-50% of it's CPU, but it's barely touching the disks. The clients are all Solaris 5.x, mostly 5.7 or 5.8 with some 5.6 and 5.9 through in.
Why, just a few weeks ago I noticed almost *exactly* those same circumstances after an upgrade and a reboot of an F820 (6.5.1R1). In this case, netapp-top.pl (or at least the old version I have?) was giving utterly nonsensical results (including a negative number of ops/sec?) so I just used "nfsstat -r" on the filer, followed up with "snoop" to confirm and identify the culprit hosts.
It seems that the getattr() calls were on the mount point itself, not a file beneath it, which may explain why "lsof -N" was confused. This was a case where we had migrated the root volume on the filer from an FC-9 shelf to a DS14 shelf, so vol0 was an entirely new volume. Prior to the work on the filer we had unmounted filesystems from the servers and machines we cared about, and expected "NFS stale file handle" errors on any client machines that we missed and would just reboot them later - but found that on rebooting from the new vol0, the Solaris clients that were freaking out and looping like you described were the ones that we hadn't touched. Oddly enough, they *didn't* report stale file handles as we'd expected, and things appeared to be working(!) - except that something in the NFS client was causing the odd traffic.
A quick and dirty "fuser -kc /troubled/mount/point" and umount/mount cycle cleared it up. Not at all sure if this applies to your situation, but the symptoms you describe exactly match what we saw.
-- Chris