Yesterday morning one of the heads on our 3270 experienced large NFS latency spikes causing our VMware hosts and their VMs to log storage timeouts. This latency does not correlate to any external metrics like CPU, network, OPS etc.
But in the logs do show CP events on the aggregate hosting the VMs:
Jan 14 05:27:56 [n04:wafl.cp.slovol:warning]: aggregate aggr2 is holding up the CP.
And the EMS log has CP events logged for the duration of the episode - what can we do to prevent these issues?
<wafl_cp_toolong_warning_1 total_ms="117825" total_dbufs="32276" clean="4312" v_ino="3" v_bm="29" a_ino="0" a_bm="3428" flush="1209"/> </LR> <LR d="14Jan2013 05:19:38" n="irt-na04" t="1358169578" id="1335304168/148007" p="4" s="Ok" o="wafl_CP_proc" vf="" type="0" seq="633232" > <wafl_cp_slovol_warning_1 voltype="aggregate" volowner="" volname="aggr2" volident="" nt="35" nb="22045" clean="1346852" v_ino="0" v_bm="113" a_ino="0" a_bm="4" flush="0" rgid="2"/>
Netapp support wants me to run perfstats, but the issue is not ongoing - things are idle
thanks
Fletcher Cocquyt Principal Engineer Information Resources and Technology (IRT) Stanford University School of Medicine
Email: fcocquyt@stanford.edu Phone: (650) 724-7485
Subscribed.
We run 3240's in HA 7-mode on 8.1.2 and we have been battling issues similar to the OP and going back and forth with netapp for the last 3 months on resolution. We have moved hotspots that netapp identified from 24/7 perfstat logs to SAS off SATA and have basically removed virtually all write heavy IO applications from the SATA aggregate but we will still see the :s cp type and all of the sudden experience latency spikes across all protocols. I think we have gone through 3 upgrades based on 'bugs' netapp claimed to have found/fixed. We are definitely in a better place, but the latency issue is still too common to feel comfortable about. Definitely would liek to be added to an escalation and would be happy to provide logs/stats/etc that may help get this issue noticed by netapp.
-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/wafl-cp-slovol-warning... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.
Hi Dave,
We just went through an escalation to the higher floors of NetApp. Professional Services came on-site and gathered data. We ran a storage vmotion to re-create the :s and perfstat-ed the whole event. After further analysis, the PSE claims to have found a "bug" related to the :s and the model of controller, FAS3160. We are awaiting their report. The "snowquester" here in DC has delayed that report.
If you have a case open, have your case owner check into case 2003994303.
On Mar 6, 2013, at 10:49 AM, dave.withers dave.withers@gnomefoo.com wrote:
Subscribed.
We run 3240's in HA 7-mode on 8.1.2 and we have been battling issues similar to the OP and going back and forth with netapp for the last 3 months on resolution. We have moved hotspots that netapp identified from 24/7 perfstat logs to SAS off SATA and have basically removed virtually all write heavy IO applications from the SATA aggregate but we will still see the :s cp type and all of the sudden experience latency spikes across all protocols. I think we have gone through 3 upgrades based on 'bugs' netapp claimed to have found/fixed. We are definitely in a better place, but the latency issue is still too common to feel comfortable about. Definitely would liek to be added to an escalation and would be happy to provide logs/stats/etc that may help get this issue noticed by netapp.
-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/wafl-cp-slovol-warning... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com. _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Snowquester, funny. ;)
:s is updating 'special files'..nothing fancy, wafl underlying maps/etc.
Something has the system with indigestion IN that process, not because of it. I used to see this as well when in the service of Netapp as the field perf guy, but for my life, I do not recall what the root is/was at THAT time.
I'll be glad to see you guys get your response. :)
On Wed, Mar 6, 2013 at 8:15 AM, Scott Eno s.eno@me.com wrote:
Hi Dave,
We just went through an escalation to the higher floors of NetApp. Professional Services came on-site and gathered data. We ran a storage vmotion to re-create the :s and perfstat-ed the whole event. After further analysis, the PSE claims to have found a "bug" related to the :s and the model of controller, FAS3160. We are awaiting their report. The "snowquester" here in DC has delayed that report.
If you have a case open, have your case owner check into case 2003994303.
On Mar 6, 2013, at 10:49 AM, dave.withers dave.withers@gnomefoo.com wrote:
Subscribed.
We run 3240's in HA 7-mode on 8.1.2 and we have been battling issues
similar
to the OP and going back and forth with netapp for the last 3 months on resolution. We have moved hotspots that netapp identified from 24/7 perfstat logs to SAS off SATA and have basically removed virtually all
write
heavy IO applications from the SATA aggregate but we will still see the
:s
cp type and all of the sudden experience latency spikes across all protocols. I think we have gone through 3 upgrades based on 'bugs'
netapp
claimed to have found/fixed. We are definitely in a better place, but
the
latency issue is still too common to feel comfortable about. Definitely would liek to be added to an escalation and would be happy to provide logs/stats/etc that may help get this issue noticed by netapp.
-- View this message in context:
http://network-appliance-toasters.10978.n7.nabble.com/wafl-cp-slovol-warning...
Sent from the Network Appliance - Toasters mailing list archive at
Nabble.com.
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
We had our meeting with NetApp yesterday and went over the Professional Services findings. Some things they listed are tasks we've been addressing since the slovol issues began, aligning mis-aligned VMs, adding disks to aggregates (or, in our case, moving VM's to larger aggr w/ faster disks). But one thing they confirmed, which was brought to my attention via an off-toasters email discussion hours before (I give that individual much thanks!!), was BURT 393877, "inefficient pre-fetching of metadata blocks delays WAFL Consistency Point."
Data ONTAP's WAFL filesystem periodically commits user-modified data to the
back-end storage media (disk or otherwise) to achieve a Consistency Point (CP).
Although a Consistency Point typically takes only a few seconds, a constraint
has been designed into the software that all operations needed for a single
Consistency Point must be completed within 10 minutes. If a CP has not been
completed before a 600-second timer expires, a "WAFL hung" panic is declared,
and a core dump is produced to permit diagnosis of the excessive CP delay.
During the processing for a CP, some disk blocks are newly brought into use,
as fresh data is stored in the active filesystem, whereas some blocks may be
released from use. (Although a block which is no longer needed in the active
filesystem may remain in use in one or more snapshots, until all the snapshots
which use it are deleted.) But any changes in block usage must be reflected in
the accounting information kept in the volume metadata. To make changes in
the block accounting, Data ONTAP must read metadata blocks from disk, bringing
them into the storage controller's physical memory. Because the freeing of
blocks often occurs in a random ordering, the workload of updating the metadata
for block frees can be much higher than for updating the metadata to reflect
blocks just being brought into use.
For greatest processing efficiency, Data ONTAP makes an effort to pre-fetch
blocks of metadata which are likely to be needed for a given Consistency Point.
However, in some releases of Data ONTAP, the pre-fetching of metadata is done
in an inefficient way, and therefore the processing for the Consistency Point
may run slower than it should. This effect can be most pronounced for certain
workloads (especially overwrite workloads) in which many blocks may be freed
in unpredictable sequences. And the problem may be compounded if other tasks
being performed by Data ONTAP attempt intensive use of the storage controller's
memory. The competition for memory may cause metadata blocks to be evicted
before the Consistency Point is finished with them, leading to buffer thrashing
and a heavy disk-read load.
In aggravated cases, the Consistency Point may be slowed so much that it cannot
be completed in 10 minutes, thus triggering a "WAFL hung" event.
The BURT doesn't list any specific workarounds, as, apparently, there's many depending on your environment and what's causing it. For us, they wanted to take each FAS3160 controller down to the boot prompt and make an environment change. They didn't say what this change was because it would have to be undone once a version of OnTap is released that fixes the issue.
On that topic, there was an almost guarantee that 8.1.2P3 will probably have the fix, but 8.1.3 will definitely have the fix. I only get an OnTap upgrade window twice a year (April & October) so I hope 8.1.2P3 has the fix. They were unsure as to the release date of 8.1.3.
Some, or most, of you may be aware of this already, but I wanted to follow up with our results in case someone else starts seeing this issue. At least you'll have a place to start with NetApp support.
Again, thanks to everyone that shared ideas on this topic! This mailing list is an invaluable resource!
From: Jeff Mohler [mailto:speedtoys.racing@gmail.com] Sent: Wednesday, March 06, 2013 11:26 AM To: Scott Eno Cc: dave.withers; toasters@teaparty.net Subject: Re: wafl_cp_slovol_warning_1 with big latency spikes
Snowquester, funny. ;)
:s is updating 'special files'..nothing fancy, wafl underlying maps/etc.
Something has the system with indigestion IN that process, not because of it. I used to see this as well when in the service of Netapp as the field perf guy, but for my life, I do not recall what the root is/was at THAT time.
I'll be glad to see you guys get your response. :)
On Wed, Mar 6, 2013 at 8:15 AM, Scott Eno <s.eno@me.com mailto:s.eno@me.com > wrote:
Hi Dave,
We just went through an escalation to the higher floors of NetApp. Professional Services came on-site and gathered data. We ran a storage vmotion to re-create the :s and perfstat-ed the whole event. After further analysis, the PSE claims to have found a "bug" related to the :s and the model of controller, FAS3160. We are awaiting their report. The "snowquester" here in DC has delayed that report.
If you have a case open, have your case owner check into case 2003994303.
On Mar 6, 2013, at 10:49 AM, dave.withers <dave.withers@gnomefoo.com mailto:dave.withers@gnomefoo.com > wrote:
Subscribed.
We run 3240's in HA 7-mode on 8.1.2 and we have been battling issues
similar
to the OP and going back and forth with netapp for the last 3 months on resolution. We have moved hotspots that netapp identified from 24/7 perfstat logs to SAS off SATA and have basically removed virtually all
write
heavy IO applications from the SATA aggregate but we will still see the :s cp type and all of the sudden experience latency spikes across all protocols. I think we have gone through 3 upgrades based on 'bugs' netapp claimed to have found/fixed. We are definitely in a better place, but the latency issue is still too common to feel comfortable about. Definitely would liek to be added to an escalation and would be happy to provide logs/stats/etc that may help get this issue noticed by netapp.
-- View this message in context:
http://network-appliance-toasters.10978.n7.nabble.com/wafl-cp-slovol-warning -1-with-big-latency-spikes-tp24495p24680.html
Sent from the Network Appliance - Toasters mailing list archive at
Nabble.com.
Toasters mailing list Toasters@teaparty.net mailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
_______________________________________________ Toasters mailing list Toasters@teaparty.net mailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters