On Tue, Apr 8, 2014 at 1:04 PM, Michael Bergman michael.bergman@ericsson.com wrote:
Michael Garrison wrote:
At one point, there was a NetApp KB article talking about how DOT takes advantage of the different CPUs and talking about bottlenecks. It appears they no longer have that article as public -
Because the information in it is no longer valid, and would lead people in the wrong directions, supposedly. I can understand this, keeping such a document up-2-date with all that's happened between 8.0.x and 8.1.x and then 8.2 w.r.t. parallelisation of things in the kernel would be a daunting task at best
Ah, okay, that's interesting. When I learned about this, we were on 8.1.x (still are), so I was under the impression it applies to 8.1.x. It not applying to 8.1.x is new knowledge for me.
One of the big things we learned from this is that when looking at Kahuna, you also need to take a look at the (Kahu) value listed in WAFL_EX. If Kahu+Kahuna add up to around 100%, that is when you have a bottleneck in the Kahuna zone. We've run into this many times on our FAS6240s.
This was valid for some (older) ONTAP release, I can't really tell which one. What rel was on your 6240s when you saw this kind of saturation? I don't think that what sysstat -M tells you is accurate in the sense that it will enable to you understand a bottleneck as described above, even in 8.0.x (it *might* be) -- definitely not in 8.1.x (big difference in the parallelisation of things, waffinity changed *a lot* between 8.0 and 8.1)
8.1.2, then 8.1.3p2. We're at 8.1.4p1 now, we still run into this problem. I certainly agree that it's not just systat -M that tells us this. Looking at a bunch of other stats, like wafltop, statit, wafl_susp, wafl scan status etc... and analyzing patterns of when we've had performance problems and had kahuna+kahu at 99-100%.
The Kahu value are items that CAN NOT run simultaneous with Kahuna items. This means that if (Kahu) is 60% and Kahuna is running at 39%, The Kahuna zone is actually at 99% - so it's bottlenecked.
True for 8.0 *iff* the Kahu value in sysstat -M takes into account the parallelism (up to 5 I think it was) of parallel-Kahuna. I don't know. Not accurate for 8.1.x, not even close
I would love to learn more information on it not being accurate for 8.1.x, if you can point it to me! I'm basing my information off of things that were explained to me and discussed with a NetApp performance engineer on site when we were having problems. Since a lot of these details are deep details most people don't care about, it's hard to fully understand it if you don't have access to internal NetApp knowledge.
There are some bugs in DOT that can contribute to this, I'd have to go back through some of my old information but I can tell you they're fixed in 8.1.4p1. However, workload can contribute immensely to this. In my experience, CIFS is impacted the most by this, since a lot of the CIFS operations are serial.
Absolutely, CIFS is very much more "serial" than NFS. I'm lucky where I am to have very NFS dominant workload, CIFS is more or less residual so we never have any issues
that'll certainly impact it are things like snapmirror deswizzling, large snapshot updates firing off at once, etc. For the most part, NFS seems to have no issues, but CIFS latency will go through the roof and if you're at the edge, you won't know it until you cross it and CIFS becomes unusable.
The problem with lots of snapshots being fired off, isn't the taking of the snapshots per se as it's literally gratis w.r.t. resources. It's the deletion of snapshots, everyone has a schedule and it has to roll... A really expensive operation inside ONTAP, as is any deletion of files really. A weakness quite simply one can say. Usually with NFS and in pre 8.1 (when parallelism got much better), the SETATTR op would always stand out as the slowest d*** thing in the whole machine, and when snapshot deletes were running... ouch. The underlying reason for SETATTR being so slow, is AFAIU that it goes through serialised parts (s-Kahuna) due to messing with the WAFL buffer cache and keeping the integrity of that is so critical that serialisation is a necessity (losing control of the integrity of WAFL buffer cache = panic and halt, it's always been that way).
I know there were some optimizations to the scanners that are in 8.1.4p1. We just recently upgraded to 8.1.4p1 and are also working on migrating to CDot. I haven't had time to go back and look at the stats to see if there's been a noticeable improvement after the upgrade or not, but something I hope to do if I get free time.
-- Mike Garrison