At one point, there was a NetApp KB article talking about how DOT takes advantage of the different CPUs and talking about bottlenecks. It appears they no longer have that article as public - https://kb.netapp.com/support/index?page=content&id=3010150&locale=e...
One of the big things we learned from this is that when looking at Kahuna, you also need to take a look at the (Kahu) value listed in WAFL_EX. If Kahu+Kahuna add up to around 100%, that is when you have a bottleneck in the Kahuna zone. We've run into this many times on our FAS6240s.
The Kahu value are items that CAN NOT run simultaneous with Kahuna items. This means that if (Kahu) is 60% and Kahuna is running at 39%, The Kahuna zone is actually at 99% - so it's bottlenecked.
There are some bugs in DOT that can contribute to this, I'd have to go back through some of my old information but I can tell you they're fixed in 8.1.4p1. However, workload can contribute immensely to this. In my experience, CIFS is impacted the most by this, since a lot of the CIFS operations are serial. Windows Offline folder caching appears to put a lot of serial workload on the controller. System operations that'll certainly impact it are things like snapmirror deswizzling, large snapshot updates firing off at once, etc. For the most part, NFS seems to have no issues, but CIFS latency will go through the roof and if you're at the edge, you won't know it until you cross it and CIFS becomes unusable.
-- Mike Garrison
On Tue, Apr 8, 2014 at 4:23 AM, Michael Bergman michael.bergman@ericsson.com wrote:
Kelley Green wrote:
We have a similar issue but different numbers. We will have one controller that is showing 100% cpu busy but when doing the sysstat -M all of the processors are showing fairly low usage. The Kahuna task is in the 80-90 plus percent.
- If you *really* have Kahuna in sysstat -M showing 80-90% you're definitely saturated and should already have latency issues. But... as always YMMV. Which version of ONTAP is this?
In general, over 50% [serial] Kahuna utlisation is a warning sign. It depends on what protocol(s) the machine is doing and what the workload is and how latency sensitive the application generating it is. There are workoads that drive s-Kahuna very much, and very little of anything else, and vice versa.
And please note:
cpu_busy and cpu_average are completely different things. I.e. these two PCs (perf counters) in the CM, show very different metrics:
system:system:cpu_busy system:system:avg_processor_busy
If we stick to 7-mode here and leave C.DOT out of it.
In a late ONTAP, say 8.1.1 with 4-8 cores, you can just ignore cpu_busy for any headroom judgement. The formula for how it's calculated is fairly complicated, but also rather meaningless these days. It used to be more meaningful when the kernel was much less threaded. The difference between 8.0.x and 8.1.x is pretty big in this respect and even more stuff was broken out of serial Kahuna in 8.2. Those of you still on 8.0.x (or even earlier) might be more insterested in cpu_busy.
That said, if you do have a 8.1.1 or 8.2 machine with only 4 cores then it will also have too little memory to really deal with workload very well, things can get bad due to WAFL not getting enough buffers to do it's do. One can say that 8.1.1 and 8.2 is really for newer FAS boxes with 8 or more cores and a lot of RAM. Only then will it do a good job for you, and then it *really* does.
I usually now look at cpu_busy being <99% as a just a sign ("proof") that there are CPU resources (the union of kernel domain capacity and pure cycles) which are not being used for anything at all. But it can say 99% pretty much forever, and you still have lots of headroom for absorbing more workload pressure w/o effecting latency in any way. In 8.1.x and above, serial Kahuna, as seen in sysstat -M, doesn't do that much anymore and it's very rare that ut comes even close to 50% in a 6200 machine. It *can* happen if you have a type of wrokload that causes things to be done which is still in s-Kahuna, I've never seen such a workload close up myself. (Things are different in smaller Filers, the sammer 3100 and 3200 boxes are not as resilient to s-Kahuna being overloaded.)
In my searching I haven't found a clear answer of what is happening (other than background tasks) and have gotten no information about how to try to determine what is happening or how to limit whatever tasks are causing so much CPU busy. Granted the number is not an obvious problem since each CPU is not high usage but it's definitely affecting what is happening with the system.
This is probably a red herring. cpu_busy being 99% all the time should not affect things in the way you're describing here. Not that I could understand If you're on a big machine, then avg_processor_busy being low means all is hunky dory. I'd say that the cause of your issues as per above is this if anything:
The Kahuna task is in the 80-90 plus percent.
If this is so, then it's definitely *bad* for lots of things.
/M
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters