Re: CPU usage and HA

8 Apr 2014

      At one point, there was a NetApp KB article talking about how DOT
takes advantage of the different CPUs and talking about bottlenecks.
It appears they no longer have that article as public -
https://kb.netapp.com/support/index?page=content&id=3010150&locale=e...
One of the big things we learned from this is that when looking at
Kahuna, you also need to take a look at the (Kahu) value listed in
WAFL_EX. If Kahu+Kahuna add up to around 100%, that is when you have a
bottleneck in the Kahuna zone. We've run into this many times on our
FAS6240s.
The Kahu value are items that CAN NOT run simultaneous with Kahuna
items. This means that if (Kahu) is 60% and Kahuna is running at 39%,
The Kahuna zone is actually at 99% - so it's bottlenecked.
There are some bugs in DOT that can contribute to this, I'd have to go
back through some of my old information but I can tell you they're
fixed in 8.1.4p1. However, workload can contribute immensely to this.
In my experience, CIFS is impacted the most by this, since a lot of
the CIFS operations are serial. Windows Offline folder caching appears
to put a lot of serial workload on the controller. System operations
that'll certainly impact it are things like snapmirror deswizzling,
large snapshot updates firing off at once, etc. For the most part, NFS
seems to have no issues, but CIFS latency will go through the roof and
if you're at the edge, you won't know it until you cross it and CIFS
becomes unusable.
--
Mike Garrison
On Tue, Apr 8, 2014 at 4:23 AM, Michael Bergman
michael.bergman@ericsson.com wrote:
...
Kelley Green wrote:
...
We have a similar issue but different numbers.  We will have one
controller that is showing 100% cpu busy but when doing the sysstat -M all
of the processors are showing fairly low usage.  The Kahuna task is in the
80-90 plus percent.

If you *really* have Kahuna in sysstat -M showing 80-90% you're
definitely saturated and should already have latency issues. But...
as always YMMV. Which version of ONTAP is this?

In general, over 50% [serial] Kahuna utlisation is a warning sign. It
depends on what protocol(s) the machine is doing and what the workload is
and how latency sensitive the application generating it is. There are
workoads that drive s-Kahuna very much, and very little of anything else,
and vice versa.
And please note:
cpu_busy and cpu_average are completely different things.
I.e. these two PCs (perf counters) in the CM, show very different metrics:
system:system:cpu_busy
system:system:avg_processor_busy
If we stick to 7-mode here and leave C.DOT out of it.
In a late ONTAP, say 8.1.1 with 4-8 cores, you can just ignore cpu_busy for
any headroom judgement. The formula for how it's calculated is fairly
complicated, but also rather meaningless these days. It used to be more
meaningful when the kernel was much less threaded. The difference between
8.0.x and 8.1.x is pretty big in this respect and even more stuff was broken
out of serial Kahuna in 8.2.  Those of you still on 8.0.x (or even earlier)
might be more insterested in cpu_busy.
That said, if you do have a 8.1.1 or 8.2 machine with only 4 cores then it
will also have too little memory to really deal with workload very well,
things can get bad due to WAFL not getting enough buffers to do it's do. One
can say that 8.1.1 and 8.2 is really for newer FAS boxes with 8 or more
cores and a lot of RAM. Only then will it do a good job for you, and then it
*really* does.
I usually now look at cpu_busy being <99% as a just a sign ("proof")  that
there are CPU resources (the union of kernel domain capacity and pure
cycles) which are not being used for anything at all.  But it can say 99%
pretty much forever, and you still have lots of headroom for absorbing more
workload pressure w/o effecting latency in any way.
In 8.1.x and above, serial Kahuna, as seen in sysstat -M, doesn't do that
much anymore and it's very rare that ut comes even close to 50% in a 6200
machine.  It *can* happen if you have a type of wrokload that causes things
to be done which is still in s-Kahuna, I've never seen such a workload close
up myself.  (Things are different in smaller Filers, the sammer 3100 and
3200 boxes are not as resilient to s-Kahuna being overloaded.)
...
In my searching I haven't found a clear answer of what is happening (other
than background tasks) and have gotten no information about how to try to
determine what is happening or how to limit whatever tasks are causing so
much CPU busy.  Granted the number is not an obvious problem since each CPU
is not high usage but it's definitely affecting what is happening with the
system.
This is probably a red herring. cpu_busy being 99% all the time should not
affect things in the way you're describing here. Not that I could understand
If you're on a big machine, then avg_processor_busy being low means all is
hunky dory.  I'd say that the cause of your issues as per above is this if
anything:
...
The Kahuna task is in the 80-90 plus percent.
If this is so, then it's definitely *bad* for lots of things.
/M

Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: CPU usage and HA