This is an F540 running 5.2.1P2D14 The filer is basically doing nothing - but every 5 seconds or so the CPU goes up to 80 or 90% and stays there for 6 or 7 seconds.
We've been having trouble with systems that have NFS mount to this filer hanging. The only solution (thus far) has been to reboot the filer.
I'm not too worried about it because this system was due to be replaced this Sunday, but it'd be a good one for the data banks if anyone knew what caused this type of behaviour.
sysstat is below.
Thanks, Graham
sysstat 1:
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 1% 3 0 0 1 0 0 0 0 0 2 1% 2 0 0 0 0 0 0 0 0 2 2% 30 0 0 59 109 0 0 0 0 2 81% 26 12 0 16 148 0 0 0 0 2 90% 5 12 0 5 2 0 0 0 0 2 90% 3 12 0 4 2 0 0 0 0 2 92% 3 9 0 3 2 800 136 0 0 2 89% 2 11 0 4 2 16 732 0 0 2 90% 3 12 0 13 2 0 0 0 0 2 90% 2 13 0 4 2 0 0 0 0 2 13% 2 1 0 0 0 0 0 0 0 2 1% 10 1 0 2 2 0 0 0 0 2 1% 2 0 0 0 0 0 0 0 0 2 1% 4 0 0 9 1 0 0 0 0 2 1% 2 0 0 0 0 0 0 0 0 2 80% 3 12 0 4 2 0 0 0 0 2 90% 2 12 0 4 2 0 0 0 0 2 92% 1 12 0 4 2 312 0 0 0 2 93% 2 12 0 4 2 16 668 0 0 2 90% 4 12 0 4 2 0 0 0 0 2 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 89% 4 11 0 4 2 0 0 0 0 2 92% 5 10 0 4 2 0 0 0 0 2 1% 10 0 0 2 2 0 0 0 0 2 1% 2 0 0 0 0 0 0 0 0 2 1% 9 0 0 1 1 0 0 0 0 2
This is an F540 running 5.2.1P2D14 The filer is basically doing nothing - but every 5 seconds or so the CPU goes up to 80 or 90% and stays there for 6 or 7 seconds.
Besides the CPU anomaly, the cache age is awfully low for a filer which is "basically doing nothing." (It's a bit hard to see if the lines are wrapped in your message; I unwrapped them below.)
The fact that there is a flurry of disk writes in the midst of this CPU activity is interesting. A CP (signified by the writes) will use some CPU, but that should stop by the time of the last write. Most likely having the CP in the middle of these events is a coincidence.
Perhaps more interesting is that in this small sample the CPU usage coincides with some CIFS activity. The low network usage suggests this is some sort of meta operation such as lock fiddling. I would look at a larger sample and, if this correlation remains in a larger sample, I'd look at what was happening with CIFS. Possibly there is some operation which causes the filer to look at a large amount of cached data (accounting for the low cache age as well as CPU usage) without actually transferring much over the wire. It's been a long time since I looked at CIFS, though.
sysstat 1:
CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 1% 3 0 0 1 0 0 0 0 0 2 1% 2 0 0 0 0 0 0 0 0 2 2% 30 0 0 59 109 0 0 0 0 2 81% 26 12 0 16 148 0 0 0 0 2 90% 5 12 0 5 2 0 0 0 0 2 90% 3 12 0 4 2 0 0 0 0 2 92% 3 9 0 3 2 800 136 0 0 2 89% 2 11 0 4 2 16 732 0 0 2 90% 3 12 0 13 2 0 0 0 0 2 90% 2 13 0 4 2 0 0 0 0 2 13% 2 1 0 0 0 0 0 0 0 2 1% 10 1 0 2 2 0 0 0 0 2 1% 2 0 0 0 0 0 0 0 0 2 1% 4 0 0 9 1 0 0 0 0 2 1% 2 0 0 0 0 0 0 0 0 2 80% 3 12 0 4 2 0 0 0 0 2 90% 2 12 0 4 2 0 0 0 0 2 92% 1 12 0 4 2 312 0 0 0 2 93% 2 12 0 4 2 16 668 0 0 2 90% 4 12 0 4 2 0 0 0 0 2 CPU NFS CIFS HTTP Net kB/s Disk kB/s Tape kB/s Cache in out read write read write age 89% 4 11 0 4 2 0 0 0 0 2 92% 5 10 0 4 2 0 0 0 0 2 1% 10 0 0 2 2 0 0 0 0 2 1% 2 0 0 0 0 0 0 0 0 2 1% 9 0 0 1 1 0 0 0 0 2
-- Karl
Perhaps more interesting is that in this small sample the CPU usage coincides with some CIFS activity. The low network usage suggests this is some sort of meta operation such as lock fiddling. I would look at a larger sample and, if this correlation remains in a larger sample, I'd look at what was happening with CIFS. Possibly there is some operation which causes the filer to look at a large amount of cached data (accounting for the low cache age as well as CPU usage) without actually transferring much over the wire. It's been a long time since I looked at CIFS, though.
mmm, good point, it could be old bug 11596, only on a smaller scale. i.e. the filer doesn't appear to hang, but the CPU is still taking a hit. this is a pretty serious bug that needs to be handled soon. there is a work around, but what really needs to happen is NetApp needs to take some action to fix this bug, either with something like the options raid.reconstruct_speed or a way to insure all directories are created with CIFS entries as well as NFS.
-steve
In 5.3, Wafl will no longer hang if a directory conversion takes more than 10 minutes and directory conversion is faster. And there is a option, wafl.create_ucode, that when set creates all new directories in unicode format, even when created from NFS.
Joan Pearson
At 01:47 AM 7/23/99 , you wrote:
Perhaps more interesting is that in this small sample the CPU usage coincides with some CIFS activity. The low network usage suggests this is some sort of meta operation such as lock fiddling. I would look at a larger sample and, if this correlation remains in a larger sample, I'd look at what was happening with CIFS. Possibly there is some operation which causes the filer to look at a large amount of cached data (accounting for the low cache age as well as CPU usage) without actually transferring much over the wire. It's been a long time since I looked at CIFS, though.
mmm, good point, it could be old bug 11596, only on a smaller scale. i.e. the filer doesn't appear to hang, but the CPU is still taking a hit. this is a pretty serious bug that needs to be handled soon. there is a work around, but what really needs to happen is NetApp needs to take some action to fix this bug, either with something like the options
raid.reconstruct_speed
or a way to insure all directories are created with CIFS entries as well as NFS.
-steve
-- Cue the music, fade to black, no such thing as no payback. -PWEI
[ armijo@cs.unm.edu ]
In 5.3, Wafl will no longer hang if a directory conversion takes more than 10 minutes and directory conversion is faster. And there is a option, wafl.create_ucode, that when set creates all new directories in unicode format, even when created from NFS.
the toasters CPU will still be maxed out, and mostly unresponsive. i'll have to play with the wafl.create_ucode option.
thanks,
-s
This is an F540 running 5.2.1P2D14 The filer is basically doing nothing - but every 5 seconds or so the CPU goes up to 80 or 90% and stays there for 6 or 7 seconds.
have you tried doing a ps on the filer to see what actually was sucking up the CPU? you'll probably either want to do this via rsh piped to more or redirected to a file.
-steve