Is anyone familiar with messages such as these, and how to diagnose what volume(s) on what aggregate is triggering it?
4/27/2018 11:24:31 node1 NOTICE wafl.zombie.susp.msg.threshold: WAFL(R) is experiencing zombie throttling possibly due to requests for large number of file deletions. This can be mitigated by a combination of a) reducing the load on the system, b) issuing the file deletion requests in smaller batches and c) increasing the limits to allow more zombies to be queued on the system (Please contact technical support).
I'm troubleshooting a latency/performance issue and this caught my eye in the event logs. Web searches aren't really coming up with anything useful and there is mention of it in the NetApp KB but no real background information or troubleshooting steps.
Also, the (R)egistered Trademark after WAFL in the notice is weird to have in a technical event log.
Ian Ehrenwald Senior Infrastructure Engineer Hachette Book Group, Inc. 1.617.263.1948 / ian.ehrenwald@hbgusa.com
This may contain confidential material. If you are not an intended recipient, please notify the sender, delete immediately, and understand that no disclosure or reliance on the information herein is permitted. Hachette Book Group may monitor email to and from our network.
Netapp is rather silent on how....challenged it is towards deleting large amounts of files, total block, or both, depending what version you are on.
And depending what version you are on, you have multiple ways to manage it, or not.
This would be a good support call, to understand what you can do, or not.
What you are probably seeing is something like this: https://www.flickr.com/photos/28804666@N08/shares/t9s941
A funner example here: https://www.flickr.com/photos/28804666@N08/shares/x32YM1
A bump in read -and- write latency, which is quite odd, as you dont see much more throughput that you did before, maybe the client(s) did a lookup storm to go find things to delete as well. In this examples, yes, throughput for the cluster went up, but its actually capable of ~4GB/sec, so I know in my environment 1.4 is scratch.
But what happened under the covers in our release (9.1xx) is that background delete workload clogs up the CP process, and it chokes the whole box, and you see B2B CPs as a result. There are some dials and bootargs to remediate this, and since then I can wipe out 16-20TB at once with no impact.
What we see via some dials and bootargs for our code on a SATA HA pair now looks like this. We delete huge amounts of hbase data every night. So its tight.
https://www.flickr.com/photos/28804666@N08/shares/E0fz56
On Fri, Apr 27, 2018 at 8:37 AM, Ehrenwald, Ian Ian.Ehrenwald@hbgusa.com wrote:
Is anyone familiar with messages such as these, and how to diagnose what volume(s) on what aggregate is triggering it?
4/27/2018 11:24:31 node1 NOTICE wafl.zombie.susp.msg.threshold: WAFL(R) is experiencing zombie throttling possibly due to requests for large number of file deletions. This can be mitigated by a combination of a) reducing the load on the system, b) issuing the file deletion requests in smaller batches and c) increasing the limits to allow more zombies to be queued on the system (Please contact technical support).
I'm troubleshooting a latency/performance issue and this caught my eye in the event logs. Web searches aren't really coming up with anything useful and there is mention of it in the NetApp KB but no real background information or troubleshooting steps.
Also, the (R)egistered Trademark after WAFL in the notice is weird to have in a technical event log.
Ian Ehrenwald Senior Infrastructure Engineer Hachette Book Group, Inc. 1.617.263.1948 / ian.ehrenwald@hbgusa.com
This may contain confidential material. If you are not an intended recipient, please notify the sender, delete immediately, and understand that no disclosure or reliance on the information herein is permitted. Hachette Book Group may monitor email to and from our network.
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
What version of ontap cmode are you on, was it a recent update?
On Fri, Apr 27, 2018 at 11:37 AM, Ehrenwald, Ian Ian.Ehrenwald@hbgusa.com wrote:
Is anyone familiar with messages such as these, and how to diagnose what volume(s) on what aggregate is triggering it?
4/27/2018 11:24:31 node1 NOTICE wafl.zombie.susp.msg.threshold: WAFL(R) is experiencing zombie throttling possibly due to requests for large number of file deletions. This can be mitigated by a combination of a) reducing the load on the system, b) issuing the file deletion requests in smaller batches and c) increasing the limits to allow more zombies to be queued on the system (Please contact technical support).
I'm troubleshooting a latency/performance issue and this caught my eye in the event logs. Web searches aren't really coming up with anything useful and there is mention of it in the NetApp KB but no real background information or troubleshooting steps.
Also, the (R)egistered Trademark after WAFL in the notice is weird to have in a technical event log.
Ian Ehrenwald Senior Infrastructure Engineer Hachette Book Group, Inc. 1.617.263.1948 / ian.ehrenwald@hbgusa.com
This may contain confidential material. If you are not an intended recipient, please notify the sender, delete immediately, and understand that no disclosure or reliance on the information herein is permitted. Hachette Book Group may monitor email to and from our network.
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters