Here's a solution.



Test your HA environments.   Record the results.   Do it again in 6/12mo, record the results.


Plan accordingly.  All Y'all are gonna have _different results_.



Or, trust that HAVING an HA environment equals it works the way people are assuming it works in your current data vacuum.



Telco's do it, why doesn't IT?



On Tue, Apr 8, 2014 at 10:04 AM, Michael Bergman <michael.bergman@ericsson.com> wrote:
Michael Garrison wrote:
At one point, there was a NetApp KB article talking about how DOT
takes advantage of the different CPUs and talking about bottlenecks.
It appears they no longer have that article as public -

Because the information in it is no longer valid, and would lead people in the wrong directions, supposedly.  I can understand this, keeping such a document up-2-date with all that's happened between 8.0.x and 8.1.x and then 8.2 w.r.t. parallelisation of things in the kernel would be a daunting task at best


One of the big things we learned from this is that when looking at
Kahuna, you also need to take a look at the (Kahu) value listed in
WAFL_EX. If Kahu+Kahuna add up to around 100%, that is when you have a
bottleneck in the Kahuna zone. We've run into this many times on our
FAS6240s.

This was valid for some (older) ONTAP release, I can't really tell which one.  What rel was on your 6240s when you saw this kind of saturation?
I don't think that what sysstat -M tells you is accurate in the sense that it will enable to you understand a bottleneck as described above, even in 8.0.x (it *might* be) -- definitely not in 8.1.x (big difference in the parallelisation of things, waffinity changed *a lot* between 8.0 and 8.1)


The Kahu value are items that CAN NOT run simultaneous with Kahuna
items. This means that if (Kahu) is 60% and Kahuna is running at 39%,
The Kahuna zone is actually at 99% - so it's bottlenecked.

True for 8.0 *iff* the Kahu value in sysstat -M takes into account the parallelism (up to 5 I think it was) of parallel-Kahuna. I don't know.
Not accurate for 8.1.x, not even close


There are some bugs in DOT that can contribute to this, I'd have to go
back through some of my old information but I can tell you they're
fixed in 8.1.4p1. However, workload can contribute immensely to this.
In my experience, CIFS is impacted the most by this, since a lot of
the CIFS operations are serial.

Absolutely, CIFS is very much more "serial" than NFS. I'm lucky where I am to have very NFS dominant workload, CIFS is more or less residual so we never have any issues


that'll certainly impact it are things like snapmirror deswizzling,
large snapshot updates firing off at once, etc. For the most part, NFS
seems to have no issues, but CIFS latency will go through the roof and
if you're at the edge, you won't know it until you cross it and CIFS
becomes unusable.

The problem with lots of snapshots being fired off, isn't the taking of the snapshots per se as it's literally gratis w.r.t. resources.  It's the deletion of snapshots, everyone has a schedule and it has to roll... A really expensive operation inside ONTAP, as is any deletion of files really.  A weakness quite simply one can say.  Usually with NFS and in pre 8.1 (when parallelism got much better), the SETATTR op would always stand out as the slowest d*** thing in the whole machine, and when snapshot deletes were running... ouch.
The underlying reason for SETATTR being so slow, is AFAIU that it goes through serialised parts (s-Kahuna) due to messing with the WAFL buffer cache and keeping the integrity of that is so critical that serialisation is a necessity (losing control of the integrity of WAFL buffer cache = panic and halt, it's always been that way).


/M
_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters



--
---
Gustatus Similis Pullus