Thanks I forgot about the b2b CP measure as well. :)
The reasons that the node utilization metric works off of Kahuna and not off of Kahu (the other serialized domain) are varied and subject to another long winded discussion, but if you look at how things are handled in bento, you will see that Kahuna affects the ENTIRE systems¹ ability to do work, and will over-ride Kahu (parallelized serial work down in the lower affinities)
as more user workload migrates into the lower affinities (volume/aggr and the like), Kahuna usage becomes less of a potential workload bottleneck, however the possibility exists, if we get some kind of bug that drops things into that processing domain, that it can and will pre-empt everything going on beneath it.
So, for example, in 8.2, since we are still doing some CIFS things in Kahuna, it¹s possible for a small CIFS workload to pre-empt a busier NFS workload, due to the amount of serial processing being demanded. It would NOT be possible for that busier NFS Œserialized¹ workload to cause CIFS meta-data type stuff happening in Kahuna to slow down, since Kahuna and Kahu are mutually exclusive execution wise, and Kahuna has priority over Kahu.
going to 8.3, this is not so much of a problem, but the fact remains, the architecture of the software is such that Kahuna is a high priority workload domain, so anything dropping into it has the potential to disrupt work going on in other parts of the system I.E. it represents a potential bottleneck to performance, and is an important thing to track when you want to represent Node Utilization.
A lot of people are listening to how folks want an Œeasy¹ button for headroom. its going to continue to get better, as OnTap gets a handle on spreading the user workload more evenly across more cores, IMHO.
PF
On 4/21/16, 3:27 PM, "toasters-bounces@teaparty.net on behalf of Michael Bergman" <toasters-bounces@teaparty.net on behalf of michael.bergman@ericsson.com> wrote:
Ok so two things (comments).
I believe Paul meant the new metric 'Node Utilisation' in his reply. N.B. there's no PC in the CM or anything like that for it, it's only inside OCPM
Since it's actually currently defined like this (I *think*):
system:system:node_util = MAX(system:system:avg_processor_busy, # Normalized to 85 100-system:system:b2b_cp_margin, <Kahuna utilisation>) # Normalized to 50
what Paul wrote makes sense:
[...] because utilization _includes_ the only domain that could cause you pain by being over utilized.
Pls note! There's no Performance Counter in the CM called system:system:b2b_cp_margin system:system:node_util.
It's just a notation I used to make it clear and stringent. I think there probably *should* be such PCs, in the future!
My general view is that Kahuna isn't the only serial domain that can cause you pain by being over utilised. It's not common, rare rather, that any of the other 9 can bottleneck a system, but it can (and has) happened. And, as I wrote before, you can get hurt by over utilised multi-threaded domains too. Again, it's not that common though personally I think that it would make a lot of sense to include at least a few of those domains in the overall fomula for 'Node Util' as well. R&D efforts is ongoing I'm sure :-)
The main argument about Kahuna being so dominant in causing trouble is heavy CIFS workload. SMB operations which have to be serialised, and are done a lot... :-(
That said: my very humble opinion is that since ONTAP 8.2.1 system:system:cpu_busy actually isn't that bad at all. If you know what it shows, and how it's calculated it tells you stuff about utilisation of some or other of the 10 serial domains inside the system. Point being: it may not be Kahuna (even if it most often is). I've watched our systems for long periods of time, looking at the difference between these two in parallel:
system:system:cpu_busy system:system:avg_processor_busy
while at the same time running sysstat -M. Conclusion: it's not at all always Kahuna that makes the former go up now and then. It's been a bit of a mystery at times, as I've had trouble matching it together so that I can tell which of the 10 single threaded domains is causing cpu_busy to increase during some measurement intervals. I need to do more with this, the data shown by sysstat -M is in the CM as PC as well so it's better to use 'stats show' in the node shell to look at it
Hope this helps, /M
On 2016-04-21 21:11, Michael Bergman wrote:
If by "montoring utilisation" Paul means this PC:
system:system:cpu_busy
(N.B. the formula that calculates this inside the Counter Mgr changed in 8.2.1 both 7- & c-mode)
...then yes, it includes the highest utilised *single threaded* kernel domain. Serial domains are all except *_exempt, wafl_xcleaner (2 threads), hostOS. For a recent/modern ONTAP that is, don't trust this if you're still on some old version!
The formula for calculating it is like this:
MAX(system:system:average_processor_busy, MAX(util_of(s-threaded domain1, s-threaded domain2,... domain10))
and it has been since 8.2.1 and still is in all 8.3 rels to this date. [...]
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters