"CPU" isn't a plagued reading..
It's just irrelevant.
People work VERY VERY hard (including here) to make Ontap look like a linux box with smoothly threading (to infinity) processes to get things done evenly across all resources.
Utopia.
It's not...no matter how hard people try.
Now..being SO visible, and SO informatically driven via sysstat, sysstat -M, and a multitude of other ways to "view" it, one could come to a conclusion that it _MEANS_ something...it HAS to...
But really...
_________________________________Jeff MohlerTech Yahoo, Storage Architect, Principal(831)454-6712 YPAC Gold Member Twitter: @PrincipalYahoo CorpIM: Hipchat & Iris
On Thursday, April 21, 2016 12:11 PM, Michael Bergman michael.bergman@ericsson.com wrote:
If by "montoring utilisation" Paul means this PC:
system:system:cpu_busy
(N.B. the formula that calculates this inside the Counter Mgr changed in 8.2.1 both 7- & c-mode)
...then yes, it includes the highest utilised *single threaded* kernel domain. Serial domains are all except *_exempt, wafl_xcleaner (2 threads), hostOS. For a recent/modern ONTAP that is, don't trust this if you're still on some old version!
The formula for calculating it is like this:
MAX(system:system:average_processor_busy, MAX(util_of(s-threaded domain1, s-threaded domain2,... domain10))
and it has been since 8.2.1 and still is in all 8.3 rels to this date.
There are 10 of these s-threaded domains. You can see them in statit output. The multi-threaded ones are not counted here in other words, but those can give you problems too. Not just wafl_exempt, which is where WAFL executes mostly (hopefully!) (it's sometimes called parallel-Kahuna).
The domain named Kahuna in statit output, is the only one included in the new Node Utilisation metric, which also includes something called B2B CP margin. s-Kahuna is the most dominant source of overload in this "domain" area, that said I've had systems suffer from overload of other single threaded domains too. And multi-threaded ones as well, there have been ONTAP bugs causing nwk_exempt to over utilise (that was *not* pleasant and hard to find). Under normal circumstances this would be really rare
The formula for the new Node Utilisation metric is basically like this:
system:system:node_util = MAX(system:system:avg_processor_busy, 100-system:system:b2b_cp_margin, MAX(single threaded domain{1,2,3,...} utilization))
The main reason for avoiding system:system:cpu_busy here, is that it's been so plagued over the years by showing the wrong (= not interesting!) thing that misunderstandings have been abundant and controversy just never seems to end around it
Anyway. 'Node Utilisation' aims to calculate, a ballpark estimate, how much "head room" there's left until the system will get into B2B CP "too much" (not the odd one, that's OK and most often not noticeable by the application /users). To do the calculation, you need to know the utilisation of the disks in the Raid Groups inside the system -- something which isn't that easy to do. There's no single PC in the CM (Counter Mgr) which will give you the equiv of what sysstat -x calls "Disk Util" -- that col will show the most utilised drive in the whole system for each iteration. I.o.w. it can be a different drive each iteration of sysstat (which is quite OK).
For scripting and doing things yourself, you pretty much have to extract *all* the util counters from *all* the drives in the system and then post process them all. In a big system, with many 100 of disks, this becomes quite cumbersome
However, the utilisation of a drive is not as obvious a metric as one may think. It seems simple; it's measured internally by the system at a 1 KHz rate -- is there a command on the disk or not?
But there is a caveat... apparently (as Paul Flores informed me recently) the "yes" answer to the Q is actually "is there data going in/out of the disk right now?" Meaning that if a drive is *really* *really* busy, so d**n busy that it spends a lot of time seeking, then the util metric will actually "lie" to you. It will go down even if the disk is busier than ever. I'm thinking that this probably doesn't matter much IRL, because it's literally only slow (7.2K rpm, large) drives which could ever get into this state -- and if your system would end up in this state you're in deep s**t anyway and there's no remedy except a reboot or kill *all* the workload generators to release the I/O pressure
Think of it as a motorway completely jammed up, no car can move anywhere. How do you "fix" it? A: close *all* the entrance ramps, and just wait. It will drain out after a while
Hope this was helpful to ppl, sorry for the length of this text but these things are quite complex and I don't want to add to confusion or cause misunderstandings more than absolutely necessary
/M
On 2016-04-21 19:47, Flores, Paul wrote:
Those are good if you want to know more about _why_ your utilization metric is high, but looking at them on their own is only part of the story. Nothing you see by looking at the domains is going to help you _more_ than just monitoring utilization, because utilization _includes_ the only domain that could cause you pain by being over utilized.
PF
From: Scott Eno <s.eno@me.com mailto:s.eno@me.com> Date: Thursday, April 21, 2016 at 12:13 PM To: Paul Flores <Paul.Flores@netapp.com mailto:Paul.Flores@netapp.com> Cc: "NGC-steve.klise-wwt.com" <steve.klise@wwt.com mailto:steve.klise@wwt.com>, Toasters <toasters@teaparty.net mailto:toasters@teaparty.net> Subject: Re: OnCommand CPU report question for 2.x OPM
Don’t know if we’re allowed to attach images, but I’ll try. If you can see the attached image, you see that marrying OCPM -> NetApp Harvest -> Grafana you get a really nice breakdown of these processes.
Sadly no alerting, just monitoring.
_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters