wafl_cp_slovol_warning_1 with big latency spikes

List overview All Threads
Download

newer

older

FCP Fan In question

Anti Virus Scanning

Fletcher Cocquyt

15 Jan 2013 15 Jan '13

9:09 p.m.

Yesterday morning one of the heads on our 3270 experienced large NFS latency spikes causing our VMware hosts and their VMs to log storage timeouts. This latency does not correlate to any external metrics like CPU, network, OPS etc.

But in the logs do show CP events on the aggregate hosting the VMs:

Jan 14 05:27:56 [n04:wafl.cp.slovol:warning]: aggregate aggr2 is holding up the CP.

And the EMS log has CP events logged for the duration of the episode - what can we do to prevent these issues?

<wafl_cp_toolong_warning_1 total_ms="117825" total_dbufs="32276" clean="4312" v_ino="3" v_bm="29" a_ino="0" a_bm="3428" flush="1209"/> </LR> <LR d="14Jan2013 05:19:38" n="irt-na04" t="1358169578" id="1335304168/148007" p="4" s="Ok" o="wafl_CP_proc" vf="" type="0" seq="633232" > <wafl_cp_slovol_warning_1 voltype="aggregate" volowner="" volname="aggr2" volident="" nt="35" nb="22045" clean="1346852" v_ino="0" v_bm="113" a_ino="0" a_bm="4" flush="0" rgid="2"/>

Netapp support wants me to run perfstats, but the issue is not ongoing - things are idle

thanks

Fletcher Cocquyt Principal Engineer Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Attachments:

attachment.html (text/html — 10.3 KB)
PastedGraphic-4.png (image/png — 70.6 KB)
PastedGraphic-3.png (image/png — 10.3 KB)

Show replies by date

dave.withers

6 Mar 6 Mar

3:49 p.m.

Subscribed.

We run 3240's in HA 7-mode on 8.1.2 and we have been battling issues similar to the OP and going back and forth with netapp for the last 3 months on resolution. We have moved hotspots that netapp identified from 24/7 perfstat logs to SAS off SATA and have basically removed virtually all write heavy IO applications from the SATA aggregate but we will still see the :s cp type and all of the sudden experience latency spikes across all protocols. I think we have gone through 3 upgrades based on 'bugs' netapp claimed to have found/fixed. We are definitely in a better place, but the latency issue is still too common to feel comfortable about. Definitely would liek to be added to an escalation and would be happy to provide logs/stats/etc that may help get this issue noticed by netapp.

-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/wafl-cp-slovol-warning... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.

Scott Eno

4:15 p.m.

Hi Dave,

We just went through an escalation to the higher floors of NetApp. Professional Services came on-site and gathered data. We ran a storage vmotion to re-create the :s and perfstat-ed the whole event. After further analysis, the PSE claims to have found a "bug" related to the :s and the model of controller, FAS3160. We are awaiting their report. The "snowquester" here in DC has delayed that report.

If you have a case open, have your case owner check into case 2003994303.

On Mar 6, 2013, at 10:49 AM, dave.withers dave.withers@gnomefoo.com wrote:

...

Subscribed.

We run 3240's in HA 7-mode on 8.1.2 and we have been battling issues similar to the OP and going back and forth with netapp for the last 3 months on resolution. We have moved hotspots that netapp identified from 24/7 perfstat logs to SAS off SATA and have basically removed virtually all write heavy IO applications from the SATA aggregate but we will still see the :s cp type and all of the sudden experience latency spikes across all protocols. I think we have gone through 3 upgrades based on 'bugs' netapp claimed to have found/fixed. We are definitely in a better place, but the latency issue is still too common to feel comfortable about. Definitely would liek to be added to an escalation and would be happy to provide logs/stats/etc that may help get this issue noticed by netapp.

-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/wafl-cp-slovol-warning... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com. _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Jeff Mohler

4:25 p.m.

Snowquester, funny. ;)

:s is updating 'special files'..nothing fancy, wafl underlying maps/etc.

Something has the system with indigestion IN that process, not because of it. I used to see this as well when in the service of Netapp as the field perf guy, but for my life, I do not recall what the root is/was at THAT time.

I'll be glad to see you guys get your response. :)

On Wed, Mar 6, 2013 at 8:15 AM, Scott Eno s.eno@me.com wrote:

...

Hi Dave,

We just went through an escalation to the higher floors of NetApp. Professional Services came on-site and gathered data. We ran a storage vmotion to re-create the :s and perfstat-ed the whole event. After further analysis, the PSE claims to have found a "bug" related to the :s and the model of controller, FAS3160. We are awaiting their report. The "snowquester" here in DC has delayed that report.

If you have a case open, have your case owner check into case 2003994303.

On Mar 6, 2013, at 10:49 AM, dave.withers dave.withers@gnomefoo.com wrote:

...
Subscribed.

We run 3240's in HA 7-mode on 8.1.2 and we have been battling issues

similar

...
to the OP and going back and forth with netapp for the last 3 months on resolution. We have moved hotspots that netapp identified from 24/7 perfstat logs to SAS off SATA and have basically removed virtually all

write

...
heavy IO applications from the SATA aggregate but we will still see the

:s

...
cp type and all of the sudden experience latency spikes across all protocols. I think we have gone through 3 upgrades based on 'bugs'

netapp

...
claimed to have found/fixed. We are definitely in a better place, but

the

...
latency issue is still too common to feel comfortable about. Definitely would liek to be added to an escalation and would be happy to provide logs/stats/etc that may help get this issue noticed by netapp.

-- View this message in context:

http://network-appliance-toasters.10978.n7.nabble.com/wafl-cp-slovol-warning...

...
Sent from the Network Appliance - Toasters mailing list archive at

Nabble.com.

...

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

-- --- Gustatus Similis Pullus

s.eno

8 Mar 8 Mar

2:54 p.m.

We had our meeting with NetApp yesterday and went over the Professional Services findings. Some things they listed are tasks we've been addressing since the slovol issues began, aligning mis-aligned VMs, adding disks to aggregates (or, in our case, moving VM's to larger aggr w/ faster disks). But one thing they confirmed, which was brought to my attention via an off-toasters email discussion hours before (I give that individual much thanks!!), was BURT 393877, "inefficient pre-fetching of metadata blocks delays WAFL Consistency Point."

Data ONTAP's WAFL filesystem periodically commits user-modified data to the

back-end storage media (disk or otherwise) to achieve a Consistency Point (CP).

Although a Consistency Point typically takes only a few seconds, a constraint

has been designed into the software that all operations needed for a single

Consistency Point must be completed within 10 minutes. If a CP has not been

completed before a 600-second timer expires, a "WAFL hung" panic is declared,

and a core dump is produced to permit diagnosis of the excessive CP delay.

During the processing for a CP, some disk blocks are newly brought into use,

as fresh data is stored in the active filesystem, whereas some blocks may be

released from use. (Although a block which is no longer needed in the active

filesystem may remain in use in one or more snapshots, until all the snapshots

which use it are deleted.) But any changes in block usage must be reflected in

the accounting information kept in the volume metadata. To make changes in

the block accounting, Data ONTAP must read metadata blocks from disk, bringing

them into the storage controller's physical memory. Because the freeing of

blocks often occurs in a random ordering, the workload of updating the metadata

for block frees can be much higher than for updating the metadata to reflect

blocks just being brought into use.

For greatest processing efficiency, Data ONTAP makes an effort to pre-fetch

blocks of metadata which are likely to be needed for a given Consistency Point.

However, in some releases of Data ONTAP, the pre-fetching of metadata is done

in an inefficient way, and therefore the processing for the Consistency Point

may run slower than it should. This effect can be most pronounced for certain

workloads (especially overwrite workloads) in which many blocks may be freed

in unpredictable sequences. And the problem may be compounded if other tasks

being performed by Data ONTAP attempt intensive use of the storage controller's

memory. The competition for memory may cause metadata blocks to be evicted

before the Consistency Point is finished with them, leading to buffer thrashing

and a heavy disk-read load.

In aggravated cases, the Consistency Point may be slowed so much that it cannot

be completed in 10 minutes, thus triggering a "WAFL hung" event.

The BURT doesn't list any specific workarounds, as, apparently, there's many depending on your environment and what's causing it. For us, they wanted to take each FAS3160 controller down to the boot prompt and make an environment change. They didn't say what this change was because it would have to be undone once a version of OnTap is released that fixes the issue.

On that topic, there was an almost guarantee that 8.1.2P3 will probably have the fix, but 8.1.3 will definitely have the fix. I only get an OnTap upgrade window twice a year (April & October) so I hope 8.1.2P3 has the fix. They were unsure as to the release date of 8.1.3.

Some, or most, of you may be aware of this already, but I wanted to follow up with our results in case someone else starts seeing this issue. At least you'll have a place to start with NetApp support.

Again, thanks to everyone that shared ideas on this topic! This mailing list is an invaluable resource!

From: Jeff Mohler [mailto:speedtoys.racing@gmail.com] Sent: Wednesday, March 06, 2013 11:26 AM To: Scott Eno Cc: dave.withers; toasters@teaparty.net Subject: Re: wafl_cp_slovol_warning_1 with big latency spikes

Snowquester, funny. ;)

:s is updating 'special files'..nothing fancy, wafl underlying maps/etc.

I'll be glad to see you guys get your response. :)

On Wed, Mar 6, 2013 at 8:15 AM, Scott Eno <s.eno@me.com mailto:s.eno@me.com > wrote:

Hi Dave,

If you have a case open, have your case owner check into case 2003994303.

On Mar 6, 2013, at 10:49 AM, dave.withers <dave.withers@gnomefoo.com mailto:dave.withers@gnomefoo.com > wrote:

...

Subscribed.

We run 3240's in HA 7-mode on 8.1.2 and we have been battling issues

similar

...

to the OP and going back and forth with netapp for the last 3 months on resolution. We have moved hotspots that netapp identified from 24/7 perfstat logs to SAS off SATA and have basically removed virtually all

write

...

heavy IO applications from the SATA aggregate but we will still see the :s cp type and all of the sudden experience latency spikes across all protocols. I think we have gone through 3 upgrades based on 'bugs' netapp claimed to have found/fixed. We are definitely in a better place, but the latency issue is still too common to feel comfortable about. Definitely would liek to be added to an escalation and would be happy to provide logs/stats/etc that may help get this issue noticed by netapp.

-- View this message in context:

http://network-appliance-toasters.10978.n7.nabble.com/wafl-cp-slovol-warning -1-with-big-latency-spikes-tp24495p24680.html

...

Sent from the Network Appliance - Toasters mailing list archive at

Nabble.com.

...

Toasters mailing list Toasters@teaparty.net mailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________ Toasters mailing list Toasters@teaparty.net mailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

-- --- Gustatus Similis Pullus

4523

Age (days ago)

4575

Last active (days ago)

toasters@lists.teaparty.net

4 comments

5 participants

tags (0)

participants (5)

dave.withers
Fletcher Cocquyt
Jeff Mohler
s.eno
Scott Eno