100% Cpu using quotas under 5.x

13 May 1999


      I have an interesting problem with one of my F540s
(maxed out on memory and disk controllers, approx 40 disks) :
It used to run 4.3.1D3 - on recommendation from my vendor
(there was a bug in the release we were running that bit us), I
upgraded to 5.2.1P2D4 - the NetApp is serving the home directories of
approx. 160.000 users (largest UID approx 161000)- and we use default quotas. 
A few (less than 500) of the users have specific quotas. 
This worked very well under 4.x - it still works, but after having been 
up and running for 20 hours or so, the NetApp becomes _very_ sluggish, 
with 100% CPU consumed - it still serves nfs requests, but very slowly. Doing
'quota off; quota on' gets the box going again - but this is hardly ideal.
I can imagine some mechanisms that would cause this:
1) Bugs have crept into the quota software in OnTap
2) Bugs have crept into syslog - we run with the default syslog config,
   and we get a _lot_ of messages about exceeded quotas.
   The messages differ slightly :
Thu May 13 21:26:12 MES [wafl_hipri]: uid 75355 tid 15: 
     disk quota exceeded on volume vol0
   Thu May 13 21:27:31 MES [de1]: uid 75355 tid 15: 
     disk quota exceeded on volume vol0
   Thu May 13 22:26:44 MES [de0]: uid 18002 tid 16: 
     disk quota exceeded on volume vol0
   Thu May 13 22:26:44 MES [wafl_lopri]: uid 18002 tid 16:
     disk quota exceeded on voume vol0
   (these are just excerpts, we get a _lot_ of them as well as quite a
   few "Thu May 13 21:40:59 MES last message repeated 7 times" related to
   the aforementioned messages)
It's the [de0|de1|hipri|lopri]-bit that puzzles me a bit - 
   can anyone explain ?
I couldn't care less about users exceeding their quotas - 
   and wouldn't mind being able to turn off the reporting. 
   (I would still like to be able to know about qtrees exceeding their 
   quotas, however).
My gut feeling is that this is the root of the problem. (too many 
   'quota exceeded' messages overloading syslogd somehow, that is)
3) The quota mechanisms can't take the load (but then, why did it work
   so well in 4.x ?)
Has anyone else seen tha same problem ? (It's being escalated through our
vendor - my experience with them and NetApp is good, but I'd like to find
out if anyone else has seen anything like this problem)
-- 
---Ketil Kirkerud Elgethun, SOL System

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

100% Cpu using quotas under 5.x