We had our meeting with NetApp yesterday and went over the Professional Services findings. Some things they listed are tasks we’ve been addressing since the slovol issues began, aligning mis-aligned VMs, adding disks to aggregates (or, in our case, moving VM’s to larger aggr w/ faster disks). But one thing they confirmed, which was brought to my attention via an off-toasters email discussion hours before (I give that individual much thanks!!), was BURT 393877, “inefficient pre-fetching of metadata blocks delays WAFL Consistency Point.”
Data ONTAP's WAFL filesystem periodically commits user-modified data to the
back-end storage media (disk or otherwise) to achieve a Consistency Point (CP).
Although a Consistency Point typically takes only a few seconds, a constraint
has been designed into the software that all operations needed for a single
Consistency Point must be completed within 10 minutes. If a CP has not been
completed before a 600-second timer expires, a "WAFL hung" panic is declared,
and a core dump is produced to permit diagnosis of the excessive CP delay.
During the processing for a CP, some disk blocks are newly brought into use,
as fresh data is stored in the active filesystem, whereas some blocks may be
released from use. (Although a block which is no longer needed in the active
filesystem may remain in use in one or more snapshots, until all the snapshots
which use it are deleted.) But any changes in block usage must be reflected in
the accounting information kept in the volume metadata. To make changes in
the block accounting, Data ONTAP must read metadata blocks from disk, bringing
them into the storage controller's physical memory. Because the freeing of
blocks often occurs in a random ordering, the workload of updating the metadata
for block frees can be much higher than for updating the metadata to reflect
blocks just being brought into use.
For greatest processing efficiency, Data ONTAP makes an effort to pre-fetch
blocks of metadata which are likely to be needed for a given Consistency Point.
However, in some releases of Data ONTAP, the pre-fetching of metadata is done
in an inefficient way, and therefore the processing for the Consistency Point
may run slower than it should. This effect can be most pronounced for certain
workloads (especially overwrite workloads) in which many blocks may be freed
in unpredictable sequences. And the problem may be compounded if other tasks
being performed by Data ONTAP attempt intensive use of the storage controller's
memory. The competition for memory may cause metadata blocks to be evicted
before the Consistency Point is finished with them, leading to buffer thrashing
and a heavy disk-read load.
In aggravated cases, the Consistency Point may be slowed so much that it cannot
be completed in 10 minutes, thus triggering a "WAFL hung" event.
The BURT doesn’t list any specific workarounds, as, apparently, there’s many depending on your environment and what’s causing it. For us, they wanted to take each FAS3160 controller down to the boot prompt and make an environment change. They didn’t say what this change was because it would have to be undone once a version of OnTap is released that fixes the issue.
On that topic, there was an almost guarantee that 8.1.2P3 will probably have the fix, but 8.1.3 will definitely have the fix. I only get an OnTap upgrade window twice a year (April & October) so I hope 8.1.2P3 has the fix. They were unsure as to the release date of 8.1.3.
Some, or most, of you may be aware of this already, but I wanted to follow up with our results in case someone else starts seeing this issue. At least you’ll have a place to start with NetApp support.
Again, thanks to everyone that shared ideas on this topic! This mailing list is an invaluable resource!
From: Jeff Mohler [mailto:speedtoys.racing@gmail.com]
Sent: Wednesday, March 06, 2013 11:26 AM
To: Scott Eno
Cc: dave.withers; toasters@teaparty.net
Subject: Re: wafl_cp_slovol_warning_1 with big latency spikes
Snowquester, funny. ;)
:s is updating 'special files'..nothing fancy, wafl underlying maps/etc.
Something has the system with indigestion IN that process, not because of it. I used to see this as well when in the service of Netapp as the field perf guy, but for my life, I do not recall what the root is/was at THAT time.
I'll be glad to see you guys get your response. :)
On Wed, Mar 6, 2013 at 8:15 AM, Scott Eno <s.eno@me.com> wrote:
Hi Dave,
We just went through an escalation to the higher floors of NetApp. Professional Services came on-site and gathered data. We ran a storage vmotion to re-create the :s and perfstat-ed the whole event. After further analysis, the PSE claims to have found a "bug" related to the :s and the model of controller, FAS3160. We are awaiting their report. The "snowquester" here in DC has delayed that report.
If you have a case open, have your case owner check into case 2003994303.
On Mar 6, 2013, at 10:49 AM, dave.withers <dave.withers@gnomefoo.com> wrote:
> Subscribed.
>
> We run 3240's in HA 7-mode on 8.1.2 and we have been battling issues similar
> to the OP and going back and forth with netapp for the last 3 months on
> resolution. We have moved hotspots that netapp identified from 24/7
> perfstat logs to SAS off SATA and have basically removed virtually all write
> heavy IO applications from the SATA aggregate but we will still see the :s
> cp type and all of the sudden experience latency spikes across all
> protocols. I think we have gone through 3 upgrades based on 'bugs' netapp
> claimed to have found/fixed. We are definitely in a better place, but the
> latency issue is still too common to feel comfortable about. Definitely
> would liek to be added to an escalation and would be happy to provide
> logs/stats/etc that may help get this issue noticed by netapp.
>
>
>
>
> --
> View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/wafl-cp-slovol-warning-1-with-big-latency-spikes-tp24495p24680.html
> Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.
> _______________________________________________
> Toasters mailing list
> Toasters@teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters
_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
--
---
Gustatus Similis Pullus