I have an interesting problem with one of my F540s (maxed out on memory and disk controllers, approx 40 disks) :
It used to run 4.3.1D3 - on recommendation from my vendor (there was a bug in the release we were running that bit us), I upgraded to 5.2.1P2D4 - the NetApp is serving the home directories of approx. 160.000 users (largest UID approx 161000)- and we use default quotas. A few (less than 500) of the users have specific quotas. This worked very well under 4.x - it still works, but after having been up and running for 20 hours or so, the NetApp becomes _very_ sluggish, with 100% CPU consumed - it still serves nfs requests, but very slowly. Doing 'quota off; quota on' gets the box going again - but this is hardly ideal.
I can imagine some mechanisms that would cause this:
1) Bugs have crept into the quota software in OnTap
2) Bugs have crept into syslog - we run with the default syslog config, and we get a _lot_ of messages about exceeded quotas. The messages differ slightly :
Thu May 13 21:26:12 MES [wafl_hipri]: uid 75355 tid 15: disk quota exceeded on volume vol0 Thu May 13 21:27:31 MES [de1]: uid 75355 tid 15: disk quota exceeded on volume vol0 Thu May 13 22:26:44 MES [de0]: uid 18002 tid 16: disk quota exceeded on volume vol0 Thu May 13 22:26:44 MES [wafl_lopri]: uid 18002 tid 16: disk quota exceeded on voume vol0 (these are just excerpts, we get a _lot_ of them as well as quite a few "Thu May 13 21:40:59 MES last message repeated 7 times" related to the aforementioned messages)
It's the [de0|de1|hipri|lopri]-bit that puzzles me a bit - can anyone explain ?
I couldn't care less about users exceeding their quotas - and wouldn't mind being able to turn off the reporting. (I would still like to be able to know about qtrees exceeding their quotas, however).
My gut feeling is that this is the root of the problem. (too many 'quota exceeded' messages overloading syslogd somehow, that is)
3) The quota mechanisms can't take the load (but then, why did it work so well in 4.x ?)
Has anyone else seen tha same problem ? (It's being escalated through our vendor - my experience with them and NetApp is good, but I'd like to find out if anyone else has seen anything like this problem)