Re: CPU usage and HA

8 Apr 2014


      On Tue, Apr 8, 2014 at 1:04 PM, Michael Bergman
michael.bergman@ericsson.com wrote:
...
Michael Garrison wrote:
...
At one point, there was a NetApp KB article talking about how DOT
takes advantage of the different CPUs and talking about bottlenecks.
It appears they no longer have that article as public -
Because the information in it is no longer valid, and would lead people in
the wrong directions, supposedly.  I can understand this, keeping such a
document up-2-date with all that's happened between 8.0.x and 8.1.x and then
8.2 w.r.t. parallelisation of things in the kernel would be a daunting task
at best
Ah, okay, that's interesting. When I learned about this, we were on
8.1.x (still are), so I was under the impression it applies to 8.1.x.
It not applying to 8.1.x is new knowledge for me.
...
...
One of the big things we learned from this is that when looking at
Kahuna, you also need to take a look at the (Kahu) value listed in
WAFL_EX. If Kahu+Kahuna add up to around 100%, that is when you have a
bottleneck in the Kahuna zone. We've run into this many times on our
FAS6240s.
This was valid for some (older) ONTAP release, I can't really tell which
one.  What rel was on your 6240s when you saw this kind of saturation?
I don't think that what sysstat -M tells you is accurate in the sense that
it will enable to you understand a bottleneck as described above, even in
8.0.x (it *might* be) -- definitely not in 8.1.x (big difference in the
parallelisation of things, waffinity changed *a lot* between 8.0 and 8.1)
8.1.2, then 8.1.3p2. We're at 8.1.4p1 now, we still run into this
problem. I certainly agree that it's not just systat -M that tells us
this. Looking at a bunch of other stats, like wafltop, statit,
wafl_susp, wafl scan status etc... and analyzing patterns of when
we've had performance problems and had kahuna+kahu at 99-100%.
...
...
The Kahu value are items that CAN NOT run simultaneous with Kahuna
items. This means that if (Kahu) is 60% and Kahuna is running at 39%,
The Kahuna zone is actually at 99% - so it's bottlenecked.
True for 8.0 *iff* the Kahu value in sysstat -M takes into account the
parallelism (up to 5 I think it was) of parallel-Kahuna. I don't know.
Not accurate for 8.1.x, not even close
I would love to learn more information on it not being accurate for
8.1.x, if you can point it to me! I'm basing my information off of
things that were explained to me and discussed with a NetApp
performance engineer on site when we were having problems. Since a lot
of these details are deep details most people don't care about, it's
hard to fully understand it if you don't have access to internal
NetApp knowledge.
...
...
There are some bugs in DOT that can contribute to this, I'd have to go
back through some of my old information but I can tell you they're
fixed in 8.1.4p1. However, workload can contribute immensely to this.
In my experience, CIFS is impacted the most by this, since a lot of
the CIFS operations are serial.
Absolutely, CIFS is very much more "serial" than NFS. I'm lucky where I am
to have very NFS dominant workload, CIFS is more or less residual so we
never have any issues
...
that'll certainly impact it are things like snapmirror deswizzling,
large snapshot updates firing off at once, etc. For the most part, NFS
seems to have no issues, but CIFS latency will go through the roof and
if you're at the edge, you won't know it until you cross it and CIFS
becomes unusable.
The problem with lots of snapshots being fired off, isn't the taking of the
snapshots per se as it's literally gratis w.r.t. resources.  It's the
deletion of snapshots, everyone has a schedule and it has to roll... A
really expensive operation inside ONTAP, as is any deletion of files really.
A weakness quite simply one can say.  Usually with NFS and in pre 8.1 (when
parallelism got much better), the SETATTR op would always stand out as the
slowest d*** thing in the whole machine, and when snapshot deletes were
running... ouch.
The underlying reason for SETATTR being so slow, is AFAIU that it goes
through serialised parts (s-Kahuna) due to messing with the WAFL buffer
cache and keeping the integrity of that is so critical that serialisation is
a necessity (losing control of the integrity of WAFL buffer cache = panic
and halt, it's always been that way).
I know there were some optimizations to the scanners that are in
8.1.4p1. We just recently upgraded to 8.1.4p1 and are also working on
migrating to CDot. I haven't had time to go back and look at the stats
to see if there's been a noticeable improvement after the upgrade or
not, but something I hope to do if I get free time.
--
Mike Garrison

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: CPU usage and HA