New subject: Aggregate Disk Busy 100% with volume IOPS low

24 Jan 2013


      Hi Mike,
this is a snap mirror SOURCE.
The perfstats captured during the issue include wafl scan status - I'll mention it to the engineer when I show the disparity btw AGGR OPS and total volume OPS.
Unfortunately its not easy to differentiate the source of IOPS with any existing tools in my experience - including perfstats and escalated Netapp support analysis.
But the NMC AGGR vs volume stats need to be explained somehow.
thanks
On Jan 23, 2013, at 9:35 PM, Mike Nye Mike.Nye@xpanse.com.au wrote:
...
Hi Fletcher,
Is this FAS a SnapMirror destination? This sounds very much like a volume deswizzle / CBR scan issue…
You could try scheduling a “priv set advanced; wafl scan status” to run via RSH/SSH during the time of poor performance, to see what internal WAFL scans are taking place on the system.
Thinks to look for are “Container Block Reclamation” and/or “Volume Deswizzle”.
Kind regards,
Mike Nye
<image001.jpg>
<image002.jpg>
Mike Nye
Team Lead Systems Engineer
mike.nye@xpanse.com.au
www.xpanse.com.au
Mob:
Office:
Fax:
+61 407 772 465
+61 8 9322 6767
+61 8 9322 6077
18 Emerald Tce
West Perth
WA 6005
<image003.jpg>
From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Fletcher Cocquyt
Sent: Thursday, 24 January 2013 1:28 PM
To: toasters@teaparty.net Lists
Subject: Aggregate Disk Busy 100% with volume IOPS low
3270 cluster, OnTAP 8.1-7mode
We are investigating a SATA aggregate showing repeated 5am disk 100% busy spikes without its volumes showing any corresponding IOPS spike as reported by Netapp Management Console (NMC).
The 5am disk busy spikes correlate with very high latency on volumes on a different SAS aggregate.  These volumes host VMs which then timeout, some needing reboots.
Today when I heard from Netapp support after reviewing my perfstat the engineer reported this is expected since NVRAM buffers are shared btw aggregates.
But when I dig further into the NMC stats I see the SATA aggregate disk busy actually corresponds to a DROP in IOPS on the 3 volumes  hosted on the SATA aggregate - almost like some internal aggregate operations are starving out the external volume ops.
I checked the snapshots (vol and aggr), snap mirror, dedup and none of the usual suspects were running.
When I look at the NMC throughput graphs and switch on the legend - it shows a 5am READ blocks/sec spike corresponding perfectly to the disk busy.
Where are these AGGR level READ operations coming from that are missing from the constituent volume IOPS, and in fact seem to be starving out volume level IO?
I don't see much in the messages log, but will check the rest of the logs for internal type OPS
thanks for any insight

Re: Aggregate Disk Busy 100% with volume IOPS low