Re: OnCommand CPU report question for 2.x OPM

21 Apr 2016


      "CPU" isn't a plagued reading..
It's just irrelevant.
People work VERY VERY hard (including here) to make Ontap look like a linux box with smoothly threading (to infinity) processes to get things done evenly across all resources.
Utopia.
It's not...no matter how hard people try.
Now..being SO visible, and SO informatically driven via sysstat, sysstat -M, and a multitude of other ways to "view" it, one could come to a conclusion that it _MEANS_ something...it HAS to...
But really...
_________________________________Jeff MohlerTech Yahoo, Storage Architect, Principal(831)454-6712
YPAC Gold Member
Twitter: @PrincipalYahoo
CorpIM:  Hipchat & Iris
On Thursday, April 21, 2016 12:11 PM, Michael Bergman michael.bergman@ericsson.com wrote:
If by "montoring utilisation" Paul means this PC:
system:system:cpu_busy
(N.B. the formula that calculates this inside the Counter Mgr changed in 
8.2.1 both 7- & c-mode)
...then yes, it includes the highest utilised *single threaded* kernel 
domain. Serial domains are all except *_exempt, wafl_xcleaner (2 threads), 
hostOS. For a recent/modern ONTAP that is, don't trust this if you're still 
on some old version!
The formula for calculating it is like this:
MAX(system:system:average_processor_busy,
    MAX(util_of(s-threaded domain1, s-threaded domain2,... domain10))
and it has been since 8.2.1 and still is in all 8.3 rels to this date.
There are 10 of these s-threaded domains. You can see them in statit output. 
The multi-threaded ones are not counted here in other words, but those can 
give you problems too. Not just wafl_exempt, which is where WAFL executes 
mostly (hopefully!) (it's sometimes called parallel-Kahuna).
The domain named Kahuna in statit output, is the only one included in the 
new Node Utilisation metric, which also includes something called B2B CP 
margin. s-Kahuna is the most dominant source of overload in this "domain" 
area, that said I've had systems suffer from overload of other single 
threaded domains too.  And multi-threaded ones as well, there have been 
ONTAP bugs causing nwk_exempt to over utilise (that was *not* pleasant and 
hard to find).  Under normal circumstances this would be really rare
The formula for the new Node Utilisation metric is basically like this:
system:system:node_util =
        MAX(system:system:avg_processor_busy,
            100-system:system:b2b_cp_margin,
            MAX(single threaded domain{1,2,3,...} utilization))
The main reason for avoiding system:system:cpu_busy here, is that it's been 
so plagued over the years by showing the wrong (= not interesting!) thing 
that misunderstandings have been abundant and controversy just never seems 
to end around it
Anyway.
'Node Utilisation' aims to calculate, a ballpark estimate, how much "head 
room" there's left until the system will get into B2B CP "too much"  (not 
the odd one, that's OK and most often not noticeable by the application 
/users).  To do the calculation, you need to know the utilisation of the 
disks in the Raid Groups inside the system -- something which isn't that 
easy to do.  There's no single PC in the CM (Counter Mgr) which will give 
you the equiv of what sysstat -x calls "Disk Util" -- that col will show the 
most utilised drive in the whole system for each iteration. I.o.w. it can be 
a different drive each iteration of sysstat (which is quite OK).
For scripting and doing things yourself, you pretty much have to extract 
*all* the util counters from *all* the drives in the system and then post 
process them all. In a big system, with many 100 of disks, this becomes 
quite cumbersome
However, the utilisation of a drive is not as obvious a metric as one may 
think. It seems simple; it's measured internally by the system at a 1 KHz 
rate -- is there a command on the disk or not?
But there is a caveat... apparently (as Paul Flores informed me recently) 
the "yes" answer to the Q is actually "is there data going in/out of the 
disk right now?"  Meaning that if a drive is *really* *really* busy, so d**n 
busy that it spends a lot of time seeking, then the util metric will 
actually "lie" to you.  It will go down even if the disk is busier than 
ever.  I'm thinking that this probably doesn't matter much IRL, because it's 
literally only slow (7.2K rpm, large) drives which could ever get into this 
state -- and if your system would end up in this state you're in deep s**t 
anyway and there's no remedy except a reboot or kill *all* the workload 
generators to release the I/O pressure
Think of it as a motorway completely jammed up, no car can move anywhere. 
How do you "fix" it?  A: close *all* the entrance ramps, and just wait. It 
will drain out after a while
Hope this was helpful to ppl, sorry for the length of this text but these 
things are quite complex and I don't want to add to confusion or cause 
misunderstandings more than absolutely necessary
/M
On 2016-04-21 19:47, Flores, Paul wrote:
...
Those are good if you want to know more about _why_ your utilization metric
is high, but looking at them on their own is only part of the story. Nothing
you see by looking at the domains is going to help you _more_ than just
monitoring utilization, because utilization _includes_ the only domain that
could cause you pain by being over utilized.
PF
From: Scott Eno <s.eno@me.com mailto:s.eno@me.com>
Date: Thursday, April 21, 2016 at 12:13 PM
To: Paul Flores <Paul.Flores@netapp.com mailto:Paul.Flores@netapp.com>
Cc: "NGC-steve.klise-wwt.com" <steve.klise@wwt.com
mailto:steve.klise@wwt.com>, Toasters <toasters@teaparty.net
mailto:toasters@teaparty.net>
Subject: Re: OnCommand CPU report question for 2.x OPM
Don’t know if we’re allowed to attach images, but I’ll try. If you can see
the attached image, you see that marrying OCPM -> NetApp Harvest -> Grafana
you get a really nice breakdown of these processes.
Sadly no alerting, just monitoring.
_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: OnCommand CPU report question for 2.x OPM