CPU usage and HA

List overview All Threads
Download

newer

older

snapmirror destination volume is...

stats list instances disk output...

Martin

4 Apr 2014 4 Apr '14

2:15 p.m.

I know that if a Filer is not busy serving data it will run background task at a higher priority that will use the spare CPU while the Filer is not under load.

In an active/active config the Filer CPU shouldn't be over 50% on each controller as above this the Filer is not longer HA.

My question is if a Filer is showing low latency and isn't busy but its CPU is at 50% on each controller is this an issue?

I'm not clear if you have two controllers showing 40/50% CPU due to background tasks that isn't busy whether it will still be HA if one controller were to fail? My thinking is those background tasks will just run at a lower level (some may not run in failover state??)

-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/CPU-usage-and-HA-tp256... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.

Show replies by date

Jeff Mohler

4 Apr 4 Apr

5:17 p.m.

CPU is a pretty poor measure of performance to the user workload..but it depends(tm) what you wanna do.

Do you think HA is to provide a 100% seamless service where there is zero impact. -or- Do you think HA is to provide services in the case of a failure, where there may be additional latency, but you are _still working_.

Either way, consistent HA testing (yearly?) will help you track the resiliency of your HA solution...because honestly CPU is not the best way to look at this, at least by itself.

On Fri, Apr 4, 2014 at 7:15 AM, Martin martin@leggatt.me.uk wrote:

...

I know that if a Filer is not busy serving data it will run background task at a higher priority that will use the spare CPU while the Filer is not under load.

In an active/active config the Filer CPU shouldn't be over 50% on each controller as above this the Filer is not longer HA.

My question is if a Filer is showing low latency and isn't busy but its CPU is at 50% on each controller is this an issue?

I'm not clear if you have two controllers showing 40/50% CPU due to background tasks that isn't busy whether it will still be HA if one controller were to fail? My thinking is those background tasks will just run at a lower level (some may not run in failover state??)

-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/CPU-usage-and-HA-tp256... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com. _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

-- --- Gustatus Similis Pullus

Martin Leggatt

5 Apr 5 Apr

5:21 p.m.

Hi Jeff,

I ask as we have a Filer that isn't experiencing any performance issues from the client perspective but is showing each controller over 50% intermittently. I'm not clear if this is because the Filer isn't busy and is doing background tasks or if it is genuinely busy (perfstats just show the Kahuna domain busy).

In the past Netapp have said that if both controllers CPU is over 50% then the Filer is not HA and an upgrade should be considered (obviously they want to sell Filers)

As you said it's not possible to confirm if this is an issue without failing the controller over and confirm if there is any latency issues on the clients.

I just think its incorrect to say a Filer that has both controllers over 50% CPU intermittently is not HA and requires an upgrade.

In my environment HA is to provide a 100% seamless service where there is zero impact but its difficult to justify testing this on a production system.

Martin

On 04/04/2014 18:17, Jeff Mohler wrote:

...

CPU is a pretty poor measure of performance to the user workload..but it depends(tm) what you wanna do.

Do you think HA is to provide a 100% seamless service where there is zero impact. -or- Do you think HA is to provide services in the case of a failure, where there may be additional latency, but you are _still working_.

Either way, consistent HA testing (yearly?) will help you track the resiliency of your HA solution...because honestly CPU is not the best way to look at this, at least by itself.

On Fri, Apr 4, 2014 at 7:15 AM, Martin <martin@leggatt.me.uk mailto:martin@leggatt.me.uk> wrote:
I know that if a Filer is not busy serving data it will run
background task
at a higher priority that will use the spare CPU while the Filer
is not
under load.

In an active/active config the Filer CPU shouldn't be over 50% on each
controller as above this the Filer is not longer HA.

My question is if a Filer is showing low latency and isn't busy
but its CPU
is at 50% on each controller is this an issue?

I'm not clear if you have two controllers showing 40/50% CPU due to
background tasks that isn't busy  whether it will still be HA if one
controller were to fail? My thinking is those background tasks
will just run
at a lower level (some may not run in failover state??)





--
View this message in context:
http://network-appliance-toasters.10978.n7.nabble.com/CPU-usage-and-HA-tp25641.html
Sent from the Network Appliance - Toasters mailing list archive at
Nabble.com.
_______________________________________________
Toasters mailing list
Toasters@teaparty.net <mailto:Toasters@teaparty.net>
http://www.teaparty.net/mailman/listinfo/toasters
--

Gustatus Similis Pullus

Kelley Green

9:05 p.m.

We have a similar issue but different numbers. We will have one controller that is showing 100% cpu busy but when doing the sysstat -M all of the processors are showing fairly low usage. The Kahuna task is in the 80-90 plus percent. mostly the users don't seem to be affected but administrative tasks are definitely affected. For instance, the status displays in OnCommand will have gaps in reporting latency or I/O or throughput etc. In my searching I haven't found a clear answer of what is happening (other than background tasks) and have gotten no information about how to try to determine what is happening or how to limit whatever tasks are causing so much CPU busy. Granted the number is not an obvious problem since each CPU is not high usage but it's definitely affecting what is happening with the system.

Kelley R. Green IT Specialist Global Technology Services - Storage Cell 801-916-1273 e-mail: krgreen@us.ibm.com

From: Martin Leggatt martin@leggatt.me.uk To: Jeff Mohler speedtoys.racing@gmail.com, Cc: "toasters@teaparty.net" toasters@teaparty.net Date: 04/05/2014 01:31 PM Subject: Re: CPU usage and HA Sent by: toasters-bounces@teaparty.net

Hi Jeff,

In the past Netapp have said that if both controllers CPU is over 50% then the Filer is not HA and an upgrade should be considered (obviously they want to sell Filers)

As you said it's not possible to confirm if this is an issue without failing the controller over and confirm if there is any latency issues on the clients.

I just think its incorrect to say a Filer that has both controllers over 50% CPU intermittently is not HA and requires an upgrade.

In my environment HA is to provide a 100% seamless service where there is zero impact but its difficult to justify testing this on a production system.

Martin

On 04/04/2014 18:17, Jeff Mohler wrote: CPU is a pretty poor measure of performance to the user workload..but it depends(tm) what you wanna do.

Either way, consistent HA testing (yearly?) will help you track the resiliency of your HA solution...because honestly CPU is not the best way to look at this, at least by itself.

On Fri, Apr 4, 2014 at 7:15 AM, Martin martin@leggatt.me.uk wrote: I know that if a Filer is not busy serving data it will run background task at a higher priority that will use the spare CPU while the Filer is not under load.

In an active/active config the Filer CPU shouldn't be over 50% on each controller as above this the Filer is not longer HA.

My question is if a Filer is showing low latency and isn't busy but its CPU is at 50% on each controller is this an issue?

-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/CPU-usage-and-HA-tp256...

Sent from the Network Appliance - Toasters mailing list archive at Nabble.com. _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

-- --- Gustatus Similis Pullus _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Jeff Mohler

11:07 p.m.

I wish I could give a more complete answer, but...I cant.

All I can say is that ontap CPU measurement is not as clear cut as you'd think it is or could be.

About the last time 'the number' meant something accurate, was when there was 1 cpu, and 1 core.

That's all I'm able to say.

On Sat, Apr 5, 2014 at 2:05 PM, Kelley Green krgreen@us.ibm.com wrote:

...

We have a similar issue but different numbers. We will have one controller that is showing 100% cpu busy but when doing the sysstat -M all of the processors are showing fairly low usage. The Kahuna task is in the 80-90 plus percent. mostly the users don't seem to be affected but administrative tasks are definitely affected. For instance, the status displays in OnCommand will have gaps in reporting latency or I/O or throughput etc. In my searching I haven't found a clear answer of what is happening (other than background tasks) and have gotten no information about how to try to determine what is happening or how to limit whatever tasks are causing so much CPU busy. Granted the number is not an obvious problem since each CPU is not high usage but it's definitely affecting what is happening with the system.

Kelley R. Green IT Specialist Global Technology Services - Storage Cell 801-916-1273 e-mail: krgreen@us.ibm.com

From: Martin Leggatt martin@leggatt.me.uk To: Jeff Mohler speedtoys.racing@gmail.com, Cc: "toasters@teaparty.net" toasters@teaparty.net Date: 04/05/2014 01:31 PM Subject: Re: CPU usage and HA Sent by: toasters-bounces@teaparty.net

Hi Jeff,

I ask as we have a Filer that isn't experiencing any performance issues from the client perspective but is showing each controller over 50% intermittently. I'm not clear if this is because the Filer isn't busy and is doing background tasks or if it is genuinely busy (perfstats just show the Kahuna domain busy).

In the past Netapp have said that if both controllers CPU is over 50% then the Filer is not HA and an upgrade should be considered (obviously they want to sell Filers)

As you said it's not possible to confirm if this is an issue without failing the controller over and confirm if there is any latency issues on the clients.

I just think its incorrect to say a Filer that has both controllers over 50% CPU intermittently is not HA and requires an upgrade.

In my environment HA is to provide a 100% seamless service where there is zero impact but its difficult to justify testing this on a production system.

Martin

On 04/04/2014 18:17, Jeff Mohler wrote: CPU is a pretty poor measure of performance to the user workload..but it depends(tm) what you wanna do.

Do you think HA is to provide a 100% seamless service where there is zero impact. -or- Do you think HA is to provide services in the case of a failure, where there may be additional latency, but you are _still working_.

Either way, consistent HA testing (yearly?) will help you track the resiliency of your HA solution...because honestly CPU is not the best way to look at this, at least by itself.

On Fri, Apr 4, 2014 at 7:15 AM, Martin <*martin@leggatt.me.uk*martin@leggatt.me.uk> wrote: I know that if a Filer is not busy serving data it will run background task at a higher priority that will use the spare CPU while the Filer is not under load.

In an active/active config the Filer CPU shouldn't be over 50% on each controller as above this the Filer is not longer HA.

My question is if a Filer is showing low latency and isn't busy but its CPU is at 50% on each controller is this an issue?

I'm not clear if you have two controllers showing 40/50% CPU due to background tasks that isn't busy whether it will still be HA if one controller were to fail? My thinking is those background tasks will just run at a lower level (some may not run in failover state??)

-- View this message in context: *http://network-appliance-toasters.10978.n7.nabble.com/CPU-usage-and-HA-tp256...http://network-appliance-toasters.10978.n7.nabble.com/CPU-usage-and-HA-tp25641.html Sent from the Network Appliance - Toasters mailing list archive at Nabble.com. _______________________________________________ Toasters mailing list *Toasters@teaparty.net* Toasters@teaparty.net *http://www.teaparty.net/mailman/listinfo/toasters*http://www.teaparty.net/mailman/listinfo/toasters

--

Gustatus Similis Pullus _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

-- --- Gustatus Similis Pullus

Michael Bergman

8 Apr 8 Apr

8:31 a.m.

Jeff Mohler wrote:

...

All I can say is that ontap CPU measurement is not as clear cut as you'd think it is or could be.

Indeed. :-) The definition is, still to this day AFAIK:

"For systems running Data ONTAP versions 7.2.1 or later, the cpu_busy counter is the greater of either average CPU utilization or the busiest domain."

So the obscurity lies in this bit: the busiest domain. What's the exact definition of that? Serial Kahuna (= the column in output of sysstat -M named Kahuna) is very important and has been for a very long time so it comes into play. But it's partly parallelised in a complicated way and that's what makes cpu_busy so "wrong" or... convoluted... or....

...

About the last time 'the number' meant something accurate, was when there was 1 cpu, and 1 core.

Mmm, that depends on what you mean by "accurate" :-) Like someone at NetApp wrote to me, on this very subject quoting ex US president B Clinton: "it depend on what the definition of 'is', is" ;-)

If someone would like to discuss in detail what the formula used to calculate the value in system:system:cpu_busy actually is, I can take that discussion in a separate e-mail conversation. I know pretty accurately for 8.0.x, not entirely sure what it is for 8.1.1 (waffinity completely re-done there making things far more complex than ever before)

Michael Bergman

8:23 a.m.

Kelley Green wrote:

...

We have a similar issue but different numbers. We will have one controller that is showing 100% cpu busy but when doing the sysstat -M all of the processors are showing fairly low usage. The Kahuna task is in the 80-90 plus percent.

1. If you *really* have Kahuna in sysstat -M showing 80-90% you're definitely saturated and should already have latency issues. But... as always YMMV. Which version of ONTAP is this?

In general, over 50% [serial] Kahuna utlisation is a warning sign. It depends on what protocol(s) the machine is doing and what the workload is and how latency sensitive the application generating it is. There are workoads that drive s-Kahuna very much, and very little of anything else, and vice versa.

And please note:

cpu_busy and cpu_average are completely different things. I.e. these two PCs (perf counters) in the CM, show very different metrics:

system:system:cpu_busy system:system:avg_processor_busy

If we stick to 7-mode here and leave C.DOT out of it.

In a late ONTAP, say 8.1.1 with 4-8 cores, you can just ignore cpu_busy for any headroom judgement. The formula for how it's calculated is fairly complicated, but also rather meaningless these days. It used to be more meaningful when the kernel was much less threaded. The difference between 8.0.x and 8.1.x is pretty big in this respect and even more stuff was broken out of serial Kahuna in 8.2. Those of you still on 8.0.x (or even earlier) might be more insterested in cpu_busy.

That said, if you do have a 8.1.1 or 8.2 machine with only 4 cores then it will also have too little memory to really deal with workload very well, things can get bad due to WAFL not getting enough buffers to do it's do. One can say that 8.1.1 and 8.2 is really for newer FAS boxes with 8 or more cores and a lot of RAM. Only then will it do a good job for you, and then it *really* does.

I usually now look at cpu_busy being <99% as a just a sign ("proof") that there are CPU resources (the union of kernel domain capacity and pure cycles) which are not being used for anything at all. But it can say 99% pretty much forever, and you still have lots of headroom for absorbing more workload pressure w/o effecting latency in any way. In 8.1.x and above, serial Kahuna, as seen in sysstat -M, doesn't do that much anymore and it's very rare that ut comes even close to 50% in a 6200 machine. It *can* happen if you have a type of wrokload that causes things to be done which is still in s-Kahuna, I've never seen such a workload close up myself. (Things are different in smaller Filers, the sammer 3100 and 3200 boxes are not as resilient to s-Kahuna being overloaded.)

...

In my searching I haven't found a clear answer of what is happening (other than background tasks) and have gotten no information about how to try to determine what is happening or how to limit whatever tasks are causing so much CPU busy. Granted the number is not an obvious problem since each CPU is not high usage but it's definitely affecting what is happening with the system.

This is probably a red herring. cpu_busy being 99% all the time should not affect things in the way you're describing here. Not that I could understand If you're on a big machine, then avg_processor_busy being low means all is hunky dory. I'd say that the cause of your issues as per above is this if anything:

...

The Kahuna task is in the 80-90 plus percent.

If this is so, then it's definitely *bad* for lots of things.

Michael Garrison

3:37 p.m.

At one point, there was a NetApp KB article talking about how DOT takes advantage of the different CPUs and talking about bottlenecks. It appears they no longer have that article as public - https://kb.netapp.com/support/index?page=content&id=3010150&locale=e...

One of the big things we learned from this is that when looking at Kahuna, you also need to take a look at the (Kahu) value listed in WAFL_EX. If Kahu+Kahuna add up to around 100%, that is when you have a bottleneck in the Kahuna zone. We've run into this many times on our FAS6240s.

The Kahu value are items that CAN NOT run simultaneous with Kahuna items. This means that if (Kahu) is 60% and Kahuna is running at 39%, The Kahuna zone is actually at 99% - so it's bottlenecked.

There are some bugs in DOT that can contribute to this, I'd have to go back through some of my old information but I can tell you they're fixed in 8.1.4p1. However, workload can contribute immensely to this. In my experience, CIFS is impacted the most by this, since a lot of the CIFS operations are serial. Windows Offline folder caching appears to put a lot of serial workload on the controller. System operations that'll certainly impact it are things like snapmirror deswizzling, large snapshot updates firing off at once, etc. For the most part, NFS seems to have no issues, but CIFS latency will go through the roof and if you're at the edge, you won't know it until you cross it and CIFS becomes unusable.

-- Mike Garrison

On Tue, Apr 8, 2014 at 4:23 AM, Michael Bergman michael.bergman@ericsson.com wrote:

...

Kelley Green wrote:

...
We have a similar issue but different numbers. We will have one controller that is showing 100% cpu busy but when doing the sysstat -M all of the processors are showing fairly low usage. The Kahuna task is in the 80-90 plus percent.

If you *really* have Kahuna in sysstat -M showing 80-90% you're definitely saturated and should already have latency issues. But... as always YMMV. Which version of ONTAP is this?

In general, over 50% [serial] Kahuna utlisation is a warning sign. It depends on what protocol(s) the machine is doing and what the workload is and how latency sensitive the application generating it is. There are workoads that drive s-Kahuna very much, and very little of anything else, and vice versa.

And please note:

cpu_busy and cpu_average are completely different things. I.e. these two PCs (perf counters) in the CM, show very different metrics:

system:system:cpu_busy system:system:avg_processor_busy

If we stick to 7-mode here and leave C.DOT out of it.

In a late ONTAP, say 8.1.1 with 4-8 cores, you can just ignore cpu_busy for any headroom judgement. The formula for how it's calculated is fairly complicated, but also rather meaningless these days. It used to be more meaningful when the kernel was much less threaded. The difference between 8.0.x and 8.1.x is pretty big in this respect and even more stuff was broken out of serial Kahuna in 8.2. Those of you still on 8.0.x (or even earlier) might be more insterested in cpu_busy.

That said, if you do have a 8.1.1 or 8.2 machine with only 4 cores then it will also have too little memory to really deal with workload very well, things can get bad due to WAFL not getting enough buffers to do it's do. One can say that 8.1.1 and 8.2 is really for newer FAS boxes with 8 or more cores and a lot of RAM. Only then will it do a good job for you, and then it *really* does.

I usually now look at cpu_busy being <99% as a just a sign ("proof") that there are CPU resources (the union of kernel domain capacity and pure cycles) which are not being used for anything at all. But it can say 99% pretty much forever, and you still have lots of headroom for absorbing more workload pressure w/o effecting latency in any way. In 8.1.x and above, serial Kahuna, as seen in sysstat -M, doesn't do that much anymore and it's very rare that ut comes even close to 50% in a 6200 machine. It *can* happen if you have a type of wrokload that causes things to be done which is still in s-Kahuna, I've never seen such a workload close up myself. (Things are different in smaller Filers, the sammer 3100 and 3200 boxes are not as resilient to s-Kahuna being overloaded.)

...
In my searching I haven't found a clear answer of what is happening (other than background tasks) and have gotten no information about how to try to determine what is happening or how to limit whatever tasks are causing so much CPU busy. Granted the number is not an obvious problem since each CPU is not high usage but it's definitely affecting what is happening with the system.

This is probably a red herring. cpu_busy being 99% all the time should not affect things in the way you're describing here. Not that I could understand If you're on a big machine, then avg_processor_busy being low means all is hunky dory. I'd say that the cause of your issues as per above is this if anything:

...
The Kahuna task is in the 80-90 plus percent.

If this is so, then it's definitely *bad* for lots of things.

/M

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Michael Bergman

5:04 p.m.

Michael Garrison wrote:

...

At one point, there was a NetApp KB article talking about how DOT takes advantage of the different CPUs and talking about bottlenecks. It appears they no longer have that article as public -

Because the information in it is no longer valid, and would lead people in the wrong directions, supposedly. I can understand this, keeping such a document up-2-date with all that's happened between 8.0.x and 8.1.x and then 8.2 w.r.t. parallelisation of things in the kernel would be a daunting task at best

...

One of the big things we learned from this is that when looking at Kahuna, you also need to take a look at the (Kahu) value listed in WAFL_EX. If Kahu+Kahuna add up to around 100%, that is when you have a bottleneck in the Kahuna zone. We've run into this many times on our FAS6240s.

This was valid for some (older) ONTAP release, I can't really tell which one. What rel was on your 6240s when you saw this kind of saturation? I don't think that what sysstat -M tells you is accurate in the sense that it will enable to you understand a bottleneck as described above, even in 8.0.x (it *might* be) -- definitely not in 8.1.x (big difference in the parallelisation of things, waffinity changed *a lot* between 8.0 and 8.1)

...

The Kahu value are items that CAN NOT run simultaneous with Kahuna items. This means that if (Kahu) is 60% and Kahuna is running at 39%, The Kahuna zone is actually at 99% - so it's bottlenecked.

True for 8.0 *iff* the Kahu value in sysstat -M takes into account the parallelism (up to 5 I think it was) of parallel-Kahuna. I don't know. Not accurate for 8.1.x, not even close

...

There are some bugs in DOT that can contribute to this, I'd have to go back through some of my old information but I can tell you they're fixed in 8.1.4p1. However, workload can contribute immensely to this. In my experience, CIFS is impacted the most by this, since a lot of the CIFS operations are serial.

Absolutely, CIFS is very much more "serial" than NFS. I'm lucky where I am to have very NFS dominant workload, CIFS is more or less residual so we never have any issues

...

that'll certainly impact it are things like snapmirror deswizzling, large snapshot updates firing off at once, etc. For the most part, NFS seems to have no issues, but CIFS latency will go through the roof and if you're at the edge, you won't know it until you cross it and CIFS becomes unusable.

The problem with lots of snapshots being fired off, isn't the taking of the snapshots per se as it's literally gratis w.r.t. resources. It's the deletion of snapshots, everyone has a schedule and it has to roll... A really expensive operation inside ONTAP, as is any deletion of files really. A weakness quite simply one can say. Usually with NFS and in pre 8.1 (when parallelism got much better), the SETATTR op would always stand out as the slowest d*** thing in the whole machine, and when snapshot deletes were running... ouch. The underlying reason for SETATTR being so slow, is AFAIU that it goes through serialised parts (s-Kahuna) due to messing with the WAFL buffer cache and keeping the integrity of that is so critical that serialisation is a necessity (losing control of the integrity of WAFL buffer cache = panic and halt, it's always been that way).

Jeff Mohler

5:10 p.m.

Here's a solution.

Test your HA environments. Record the results. Do it again in 6/12mo, record the results.

Plan accordingly. All Y'all are gonna have _different results_.

Or, trust that HAVING an HA environment equals it works the way people are assuming it works in your current data vacuum.

Telco's do it, why doesn't IT?

On Tue, Apr 8, 2014 at 10:04 AM, Michael Bergman < michael.bergman@ericsson.com> wrote:

...

Michael Garrison wrote:

...
At one point, there was a NetApp KB article talking about how DOT takes advantage of the different CPUs and talking about bottlenecks. It appears they no longer have that article as public -

Because the information in it is no longer valid, and would lead people in the wrong directions, supposedly. I can understand this, keeping such a document up-2-date with all that's happened between 8.0.x and 8.1.x and then 8.2 w.r.t. parallelisation of things in the kernel would be a daunting task at best

One of the big things we learned from this is that when looking at

...
Kahuna, you also need to take a look at the (Kahu) value listed in WAFL_EX. If Kahu+Kahuna add up to around 100%, that is when you have a bottleneck in the Kahuna zone. We've run into this many times on our FAS6240s.

This was valid for some (older) ONTAP release, I can't really tell which one. What rel was on your 6240s when you saw this kind of saturation? I don't think that what sysstat -M tells you is accurate in the sense that it will enable to you understand a bottleneck as described above, even in 8.0.x (it *might* be) -- definitely not in 8.1.x (big difference in the parallelisation of things, waffinity changed *a lot* between 8.0 and 8.1)

The Kahu value are items that CAN NOT run simultaneous with Kahuna

...
items. This means that if (Kahu) is 60% and Kahuna is running at 39%, The Kahuna zone is actually at 99% - so it's bottlenecked.

True for 8.0 *iff* the Kahu value in sysstat -M takes into account the parallelism (up to 5 I think it was) of parallel-Kahuna. I don't know. Not accurate for 8.1.x, not even close

There are some bugs in DOT that can contribute to this, I'd have to go

...
back through some of my old information but I can tell you they're fixed in 8.1.4p1. However, workload can contribute immensely to this. In my experience, CIFS is impacted the most by this, since a lot of the CIFS operations are serial.

Absolutely, CIFS is very much more "serial" than NFS. I'm lucky where I am to have very NFS dominant workload, CIFS is more or less residual so we never have any issues

that'll certainly impact it are things like snapmirror deswizzling,

...
large snapshot updates firing off at once, etc. For the most part, NFS seems to have no issues, but CIFS latency will go through the roof and if you're at the edge, you won't know it until you cross it and CIFS becomes unusable.

The problem with lots of snapshots being fired off, isn't the taking of the snapshots per se as it's literally gratis w.r.t. resources. It's the deletion of snapshots, everyone has a schedule and it has to roll... A really expensive operation inside ONTAP, as is any deletion of files really. A weakness quite simply one can say. Usually with NFS and in pre 8.1 (when parallelism got much better), the SETATTR op would always stand out as the slowest d*** thing in the whole machine, and when snapshot deletes were running... ouch. The underlying reason for SETATTR being so slow, is AFAIU that it goes through serialised parts (s-Kahuna) due to messing with the WAFL buffer cache and keeping the integrity of that is so critical that serialisation is a necessity (losing control of the integrity of WAFL buffer cache = panic and halt, it's always been that way).

/M _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

-- --- Gustatus Similis Pullus

Michael Garrison

5:23 p.m.

On Tue, Apr 8, 2014 at 1:04 PM, Michael Bergman michael.bergman@ericsson.com wrote:

...

Michael Garrison wrote:

...
At one point, there was a NetApp KB article talking about how DOT takes advantage of the different CPUs and talking about bottlenecks. It appears they no longer have that article as public -

Because the information in it is no longer valid, and would lead people in the wrong directions, supposedly. I can understand this, keeping such a document up-2-date with all that's happened between 8.0.x and 8.1.x and then 8.2 w.r.t. parallelisation of things in the kernel would be a daunting task at best

Ah, okay, that's interesting. When I learned about this, we were on 8.1.x (still are), so I was under the impression it applies to 8.1.x. It not applying to 8.1.x is new knowledge for me.

...

...
One of the big things we learned from this is that when looking at Kahuna, you also need to take a look at the (Kahu) value listed in WAFL_EX. If Kahu+Kahuna add up to around 100%, that is when you have a bottleneck in the Kahuna zone. We've run into this many times on our FAS6240s.

This was valid for some (older) ONTAP release, I can't really tell which one. What rel was on your 6240s when you saw this kind of saturation? I don't think that what sysstat -M tells you is accurate in the sense that it will enable to you understand a bottleneck as described above, even in 8.0.x (it *might* be) -- definitely not in 8.1.x (big difference in the parallelisation of things, waffinity changed *a lot* between 8.0 and 8.1)

8.1.2, then 8.1.3p2. We're at 8.1.4p1 now, we still run into this problem. I certainly agree that it's not just systat -M that tells us this. Looking at a bunch of other stats, like wafltop, statit, wafl_susp, wafl scan status etc... and analyzing patterns of when we've had performance problems and had kahuna+kahu at 99-100%.

...

...
The Kahu value are items that CAN NOT run simultaneous with Kahuna items. This means that if (Kahu) is 60% and Kahuna is running at 39%, The Kahuna zone is actually at 99% - so it's bottlenecked.

True for 8.0 *iff* the Kahu value in sysstat -M takes into account the parallelism (up to 5 I think it was) of parallel-Kahuna. I don't know. Not accurate for 8.1.x, not even close

I would love to learn more information on it not being accurate for 8.1.x, if you can point it to me! I'm basing my information off of things that were explained to me and discussed with a NetApp performance engineer on site when we were having problems. Since a lot of these details are deep details most people don't care about, it's hard to fully understand it if you don't have access to internal NetApp knowledge.

...

...
There are some bugs in DOT that can contribute to this, I'd have to go back through some of my old information but I can tell you they're fixed in 8.1.4p1. However, workload can contribute immensely to this. In my experience, CIFS is impacted the most by this, since a lot of the CIFS operations are serial.

Absolutely, CIFS is very much more "serial" than NFS. I'm lucky where I am to have very NFS dominant workload, CIFS is more or less residual so we never have any issues

...
that'll certainly impact it are things like snapmirror deswizzling, large snapshot updates firing off at once, etc. For the most part, NFS seems to have no issues, but CIFS latency will go through the roof and if you're at the edge, you won't know it until you cross it and CIFS becomes unusable.

The problem with lots of snapshots being fired off, isn't the taking of the snapshots per se as it's literally gratis w.r.t. resources. It's the deletion of snapshots, everyone has a schedule and it has to roll... A really expensive operation inside ONTAP, as is any deletion of files really. A weakness quite simply one can say. Usually with NFS and in pre 8.1 (when parallelism got much better), the SETATTR op would always stand out as the slowest d*** thing in the whole machine, and when snapshot deletes were running... ouch. The underlying reason for SETATTR being so slow, is AFAIU that it goes through serialised parts (s-Kahuna) due to messing with the WAFL buffer cache and keeping the integrity of that is so critical that serialisation is a necessity (losing control of the integrity of WAFL buffer cache = panic and halt, it's always been that way).

I know there were some optimizations to the scanners that are in 8.1.4p1. We just recently upgraded to 8.1.4p1 and are also working on migrating to CDot. I haven't had time to go back and look at the stats to see if there's been a noticeable improvement after the upgrade or not, but something I hope to do if I get free time.

-- Mike Garrison

4259

Age (days ago)

4263

Last active (days ago)

toasters@lists.teaparty.net

10 comments

6 participants

tags (0)

participants (6)

Jeff Mohler
Kelley Green
Martin
Martin Leggatt
Michael Bergman
Michael Garrison