On Mon, 4 Mar 2002, Jose Celestino wrote:
Ahh, in the meantime I was able to get the output of filestats, it may be of any help....
[snip]
The getattr seems way too big and this may point to a bad caching on the frontends. But could this bring the CPU to 100% most of the time? Could this be a wafl issue related with the low available space on the volume?
Just out of curiosity, what's your "wafl.maxdirsize" option set to? Is there a chance you've got one directory that's reached its limit? You didn't mention what sort of directory structure your application is using, and with 9.9M files perhaps there's a directory that's full.
I suggest this because it bit us recently and produced _very_ similar symptoms to what you described. We had an application that managed to hit the limit, having put 102,399 files in one directory, and then started looping trying to rename a 102,400th file into it. The result was a load of about 1800+ NFS ops/sec and artificially high cpu usage numbers, plus an /etc/messages file that dutifully logged the thousands upon thousands of "ENOSPC" errors, which our application patiently and persistently ignored. :-) After 24 hours our Cricket NFS ops/sec graphs looked bonkers.
So, it might seem a little wierd, but check your messages log for ENOSPC:
Wed Feb 20 16:05:22 PST [GbE-e7]: Returning ENOSPC for large dir (fsid 26082, inum 2135872)
and see if perhaps you've hit a directory size limit. An application that's trying to be "well behaved" and re-try a failed creat() or rename() could be the source of all those mysterious getattrs. Upping the maxdirsize would alleviate that, as would splitting up any very full directories.
-- Chris
-- Chris Lamb, Unix Guy MeasureCast, Inc. 503-241-1469 x247 skeezics@measurecast.com