"CPU" isn't a plagued reading..

It's just irrelevant.

People work VERY VERY hard (including here) to make Ontap look like a linux box with smoothly threading (to infinity) processes to get things done evenly across all resources.

Utopia.

It's not...no matter how hard people try.

Now..being SO visible, and SO informatically driven via sysstat, sysstat -M, and a multitude of other ways to "view" it, one could come to a conclusion that it _MEANS_ something...it HAS to...

But really...

Inline image

 
_________________________________
Jeff Mohler
Tech Yahoo, Storage Architect, Principal
Twitter: @PrincipalYahoo
CorpIM:  Hipchat & Iris



On Thursday, April 21, 2016 12:11 PM, Michael Bergman <michael.bergman@ericsson.com> wrote:


If by "montoring utilisation" Paul means this PC:

system:system:cpu_busy

(N.B. the formula that calculates this inside the Counter Mgr changed in
8.2.1 both 7- & c-mode)

...then yes, it includes the highest utilised *single threaded* kernel
domain. Serial domains are all except *_exempt, wafl_xcleaner (2 threads),
hostOS. For a recent/modern ONTAP that is, don't trust this if you're still
on some old version!

The formula for calculating it is like this:

MAX(system:system:average_processor_busy,
    MAX(util_of(s-threaded domain1, s-threaded domain2,... domain10))

and it has been since 8.2.1 and still is in all 8.3 rels to this date.



There are 10 of these s-threaded domains. You can see them in statit output.
The multi-threaded ones are not counted here in other words, but those can
give you problems too. Not just wafl_exempt, which is where WAFL executes
mostly (hopefully!) (it's sometimes called parallel-Kahuna).

The domain named Kahuna in statit output, is the only one included in the
new Node Utilisation metric, which also includes something called B2B CP
margin. s-Kahuna is the most dominant source of overload in this "domain"
area, that said I've had systems suffer from overload of other single
threaded domains too.  And multi-threaded ones as well, there have been
ONTAP bugs causing nwk_exempt to over utilise (that was *not* pleasant and
hard to find).  Under normal circumstances this would be really rare

The formula for the new Node Utilisation metric is basically like this:

system:system:node_util =
        MAX(system:system:avg_processor_busy,
            100-system:system:b2b_cp_margin,
            MAX(single threaded domain{1,2,3,...} utilization))


The main reason for avoiding system:system:cpu_busy here, is that it's been
so plagued over the years by showing the wrong (= not interesting!) thing
that misunderstandings have been abundant and controversy just never seems
to end around it

Anyway.
'Node Utilisation' aims to calculate, a ballpark estimate, how much "head
room" there's left until the system will get into B2B CP "too much"  (not
the odd one, that's OK and most often not noticeable by the application
/users).  To do the calculation, you need to know the utilisation of the
disks in the Raid Groups inside the system -- something which isn't that
easy to do.  There's no single PC in the CM (Counter Mgr) which will give
you the equiv of what sysstat -x calls "Disk Util" -- that col will show the
most utilised drive in the whole system for each iteration. I.o.w. it can be
a different drive each iteration of sysstat (which is quite OK).

For scripting and doing things yourself, you pretty much have to extract
*all* the util counters from *all* the drives in the system and then post
process them all. In a big system, with many 100 of disks, this becomes
quite cumbersome

However, the utilisation of a drive is not as obvious a metric as one may
think. It seems simple; it's measured internally by the system at a 1 KHz
rate -- is there a command on the disk or not?

But there is a caveat... apparently (as Paul Flores informed me recently)
the "yes" answer to the Q is actually "is there data going in/out of the
disk right now?"  Meaning that if a drive is *really* *really* busy, so d**n
busy that it spends a lot of time seeking, then the util metric will
actually "lie" to you.  It will go down even if the disk is busier than
ever.  I'm thinking that this probably doesn't matter much IRL, because it's
literally only slow (7.2K rpm, large) drives which could ever get into this
state -- and if your system would end up in this state you're in deep s**t
anyway and there's no remedy except a reboot or kill *all* the workload
generators to release the I/O pressure

Think of it as a motorway completely jammed up, no car can move anywhere.
How do you "fix" it?  A: close *all* the entrance ramps, and just wait. It
will drain out after a while

Hope this was helpful to ppl, sorry for the length of this text but these
things are quite complex and I don't want to add to confusion or cause
misunderstandings more than absolutely necessary

/M

On 2016-04-21 19:47, Flores, Paul wrote:
> Those are good if you want to know more about _why_ your utilization metric
> is high, but looking at them on their own is only part of the story. Nothing
> you see by looking at the domains is going to help you _more_ than just
> monitoring utilization, because utilization _includes_ the only domain that
> could cause you pain by being over utilized.
>
> PF
>
> From: Scott Eno <s.eno@me.com <mailto:s.eno@me.com>>
> Date: Thursday, April 21, 2016 at 12:13 PM
> To: Paul Flores <Paul.Flores@netapp.com <mailto:Paul.Flores@netapp.com>>
> Cc: "NGC-steve.klise-wwt.com" <steve.klise@wwt.com
> <mailto:steve.klise@wwt.com>>, Toasters <toasters@teaparty.net
> <mailto:toasters@teaparty.net>>

> Subject: Re: OnCommand CPU report question for 2.x OPM
>
> Don’t know if we’re allowed to attach images, but I’ll try. If you can see
> the attached image, you see that marrying OCPM -> NetApp Harvest -> Grafana
> you get a really nice breakdown of these processes.
>
> Sadly no alerting, just monitoring.
>
_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters