Re: OnCommand CPU report question for 2.x OPM

22 Apr 2016

      Thanks I forgot about the b2b CP measure as well. :)
The reasons that the node utilization metric works off of Kahuna and not
off of Kahu (the other serialized domain) are varied and subject to
another long winded discussion, but if you look at how things are handled
in bento, you will see that Kahuna affects the ENTIRE systems¹ ability to
do work, and will over-ride Kahu (parallelized serial work down in the
lower affinities)
as more user workload migrates into the lower affinities (volume/aggr and
the like), Kahuna usage becomes less of a potential workload bottleneck,
however the possibility exists, if we get some kind of bug that drops
things into that processing domain, that it can and will pre-empt
everything going on beneath it.
So, for example, in 8.2, since we are still doing some CIFS things in
Kahuna, it¹s possible for a small CIFS workload to pre-empt a busier NFS
workload, due to the amount of serial processing being demanded. It would
NOT be possible for that busier NFS Œserialized¹ workload to cause CIFS
meta-data type stuff happening in Kahuna to slow down, since Kahuna and
Kahu are mutually exclusive execution wise, and Kahuna has priority over
Kahu.
going to 8.3, this is not so much of a problem, but the fact remains, the
architecture of the software is such that Kahuna is a high priority
workload domain, so anything dropping into it has the potential to disrupt
work going on in other parts of the system I.E. it represents a potential
bottleneck to performance, and is an important thing to track when you
want to represent Node Utilization.
A lot of people are listening to how folks want an Œeasy¹ button for
headroom. its going to continue to get better, as OnTap gets a handle on
spreading the user workload more evenly across more cores, IMHO.
PF
On 4/21/16, 3:27 PM, "toasters-bounces@teaparty.net on behalf of Michael
Bergman" <toasters-bounces@teaparty.net on behalf of
michael.bergman@ericsson.com> wrote:
...
Ok so two things (comments).

I believe Paul meant the new metric 'Node Utilisation' in his reply.
N.B. there's no PC in the CM or anything like that for it, it's only
inside OCPM
Since it's actually currently defined like this (I *think*):
system:system:node_util =
   MAX(system:system:avg_processor_busy,  # Normalized to 85
   100-system:system:b2b_cp_margin,
   <Kahuna utilisation>)                  # Normalized to 50
what Paul wrote makes sense:
...
[...] because utilization _includes_ the only domain that could cause
you pain by being over utilized.

Pls note!  There's no Performance Counter in the CM called
system:system:b2b_cp_margin
system:system:node_util.
It's just a notation I used to make it clear and stringent. I think there
probably *should* be such PCs, in the future!
My general view is that Kahuna isn't the only serial domain that can
cause 
you pain by being over utilised. It's not common, rare rather, that any
of 
the other 9 can bottleneck a system, but it can (and has) happened.  And,
as 
I wrote before, you can get hurt by over utilised multi-threaded domains
too.  Again, it's not that common though personally I think that it would
make a lot of sense to include at least a few of those domains in the
overall fomula for 'Node Util' as well.  R&D efforts is ongoing I'm sure
:-)
The main argument about Kahuna being so dominant in causing trouble is
heavy 
CIFS workload. SMB operations which have to be serialised, and are done a
lot... :-(
That said:  my very humble opinion is that since ONTAP 8.2.1
system:system:cpu_busy actually isn't that bad at all. If you know what
it 
shows, and how it's calculated it tells you stuff about utilisation of
some 
or other of the 10 serial domains inside the system. Point being: it may
not 
be Kahuna (even if it most often is).  I've watched our systems for long
periods of time, looking at the difference between these two in parallel:
system:system:cpu_busy
system:system:avg_processor_busy
while at the same time running sysstat -M.  Conclusion: it's not at all
always Kahuna that makes the former go up now and then.  It's been a bit
of 
a mystery at times, as I've had trouble matching it together so that I
can 
tell which of the 10 single threaded domains is causing cpu_busy to
increase 
during some measurement intervals.  I need to do more with this, the data
shown by sysstat -M is in the CM as PC as well so it's better to use
'stats 
show' in the node shell to look at it
Hope this helps,
/M
On 2016-04-21 21:11, Michael Bergman wrote:
...
If by "montoring utilisation" Paul means this PC:
system:system:cpu_busy
(N.B. the formula that calculates this inside the Counter Mgr changed in
8.2.1 both 7- & c-mode)
...then yes, it includes the highest utilised *single threaded* kernel
domain. Serial domains are all except *_exempt, wafl_xcleaner (2
threads),
hostOS. For a recent/modern ONTAP that is, don't trust this if you're
still
on some old version!
The formula for calculating it is like this:
MAX(system:system:average_processor_busy,
MAX(util_of(s-threaded domain1, s-threaded domain2,... domain10))
and it has been since 8.2.1 and still is in all 8.3 rels to this date.
[...]

Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: OnCommand CPU report question for 2.x OPM