toasters January 2013

toasters@lists.teaparty.net

45 participants
23 discussions

Aggregate Disk Busy 100% with volume IOPS low
by Fletcher Cocquyt 01 Apr '13

01 Apr '13

3270 cluster, OnTAP 8.1-7mode We are investigating a SATA aggregate showing repeated 5am disk 100% busy spikes without its volumes showing any corresponding IOPS spike as reported by Netapp Management Console (NMC). The 5am disk busy spikes correlate with very high latency on volumes on a different SAS aggregate. These volumes host VMs which then timeout, some needing reboots. Today when I heard from Netapp support after reviewing my perfstat the engineer reported this is expected since NVRAM buffers are shared btw aggregates. But when I dig further into the NMC stats I see the SATA aggregate disk busy actually corresponds to a DROP in IOPS on the 3 volumes hosted on the SATA aggregate - almost like some internal aggregate operations are starving out the external volume ops. I checked the snapshots (vol and aggr), snap mirror, dedup and none of the usual suspects were running. When I look at the NMC throughput graphs and switch on the legend - it shows a 5am READ blocks/sec spike corresponding perfectly to the disk busy. Where are these AGGR level READ operations coming from that are missing from the constituent volume IOPS, and in fact seem to be starving out volume level IO? I don't see much in the messages log, but will check the rest of the logs for internal type OPS thanks for any insight

13 24

wafl_cp_slovol_warning_1 with big latency spikes
by Fletcher Cocquyt 08 Mar '13

08 Mar '13

Yesterday morning one of the heads on our 3270 experienced large NFS latency spikes causing our VMware hosts and their VMs to log storage timeouts. This latency does not correlate to any external metrics like CPU, network, OPS etc. But in the logs do show CP events on the aggregate hosting the VMs: Jan 14 05:27:56 [n04:wafl.cp.slovol:warning]: aggregate aggr2 is holding up the CP. And the EMS log has CP events logged for the duration of the episode - what can we do to prevent these issues? <wafl_cp_toolong_warning_1 total_ms="117825" total_dbufs="32276" clean="4312" v_ino="3" v_bm="29" a_ino="0" a_bm="3428" flush="1209"/> </LR> <LR d="14Jan2013 05:19:38" n="irt-na04" t="1358169578" id="1335304168/148007" p="4" s="Ok" o="wafl_CP_proc" vf="" type="0" seq="633232" > <wafl_cp_slovol_warning_1 voltype="aggregate" volowner="" volname="aggr2" volident="" nt="35" nb="22045" clean="1346852" v_ino="0" v_bm="113" a_ino="0" a_bm="4" flush="0" rgid="2"/> Netapp support wants me to run perfstats, but the issue is not ongoing - things are idle thanks Fletcher Cocquyt Principal Engineer Information Resources and Technology (IRT) Stanford University School of Medicine Email: fcocquyt(a)stanford.edu Phone: (650) 724-7485

5 4

Re: wafl_cp_slovol_warning_1 with big latency spikes
by Brian Beaulieu 04 Mar '13

04 Mar '13

I had opened a case when this happened and let the TSE know that I'm not the only one having this issue.. asked for it to be escalated and treated as a possible bug. We'll see where it goes. Brian On Thu, Jan 17, 2013 at 4:16 PM, Scott Eno <s.eno(a)icloud.com> wrote: > We're on 8.1.1P1 across the environment. > > 7-mode, of course. > > > On Jan 17, 2013, at 3:35 PM, Brian Beaulieu <brian.beaulieu(a)gmail.com> > wrote: > > Scott, what version of OnTAP are you on? > We're on 8.1.1P1 > > I saw a BURT that sounded related to this but it was apparently fixed by > 8.1.1P1. > > Brian > > > On Thu, Jan 17, 2013 at 2:21 PM, Jeff Mohler <speedtoys.racing(a)gmail.com>wrote: > >> :s is wafl updating special files in the CP process. >> >> >> Going on _that_ long....??? A few seconds of special file updates in a >> CP sure, but that much? >> >> I'd be pretty pushy on getting an answer, id put that in the "its a big >> bug" bucket. That's not normal IO activity in a healthy system. >> >> >> On Thu, Jan 17, 2013 at 8:07 PM, Scott Eno <s.eno(a)me.com> wrote: >> >>> >>> Yes, this is what I see. ":s" and all the other protocols go to "0". >>> >>> There's been some correlation, when this happens, to cleanup of VMware >>> snapshots (not NetApp snaps on the volumes, but VMware snapshots of vm's >>> via vcenter). But it happens other times too. >>> >>> >>> >>> On Jan 17, 2013, at 1:47 PM, Brian Beaulieu <brian.beaulieu(a)gmail.com> >>> wrote: >>> >>> 3rd time is the charm. >>> >>> I've attached my sysstat from the other night when NFS/CIFS hung up... >>> is this what you've seen as well? >>> >>> During that issue, FCP was also slow.. had some MPIO failovers happening >>> on our AIX LPARs. But, AIX handles that just fine and at least has an >>> alternate path through the other filer. NFS isn't so lucky. >>> >>> I have a 3250+1TB PAM sitting on deck.. you'd think that the 3240+512GB >>> PAM would be sufficient for what we do. >>> While I do have SATA in use for VMWare, it's not heavy hitting VMs.. >>> it's the dormant stuff, mostly. >>> I'm moving a lot of it, though, to 6xDS4243x600GB-15k shelves ASAP. >>> >>> I'm drinking the PAM kool-aid too but do have some measurable results >>> primarily on our PeopleSoft DB2 databases. >>> I definitely wouldn't bet on SATA+PAM == FC/SAS performance. >>> >>> Brian >>> <sysstat - Copy.txt>_______________________________________________ >>> >>> Toasters mailing list >>> Toasters(a)teaparty.net >>> http://www.teaparty.net/mailman/listinfo/toasters >>> >>> >>> >>> _______________________________________________ >>> Toasters mailing list >>> Toasters(a)teaparty.net >>> http://www.teaparty.net/mailman/listinfo/toasters >>> >>> >> >> >> -- >> --- >> Gustatus Similis Pullus > > > _______________________________________________ > Toasters mailing list > Toasters(a)teaparty.net > http://www.teaparty.net/mailman/listinfo/toasters > > >

4 3

6080 heads with 6040 NVRAM cards.
by Jeff Cleverley 12 Feb '13

12 Feb '13

Greetings, I'm thinking about doing something that is not supported and was wondering if anyone had done the same or has more detailed insight. We have a very busy cluster (6040s 7.3.5.1P4). It looks like we are largely maxing out the heads for CPU. We are getting a pair of 6080s and really need to try and do the head swap live (takeover / giveback) if at all possible. The unsupported part I want to do is keep the 6040 NVRAM cards and put them in the 6080s as I swap them. The reason for this is I would not have to change the system ID ownership on all the drives. I know changing the system ID is generally not a big deal by booting each head to maintenance mode and reassigning the old SID to the new SID. In our case it worries me. Last week we were going to move a project to the other head by reassigning the appropriate drives for a couple of aggregates. While trying to reassign these the SAS buses started panic'ing and crashed the controlling filer. The entire cluster was down. The ensuing mess took several hours to clean up. If it crashed while trying to change ownership of a few drives, I'm afraid of what will happen when it tries to reassign all the old SID drives for the new NVRAM card. I was hoping if we could keep the cards, we could swap heads, not change SIDs, and minimize our chance of repeating the crash. I could do the disks one at a time, but I have 796 drives on this cluster and would rather not. Is there a requirement for the hardware to have the bigger memory cards? Since there are more CPUs, I can see where maybe something needs it, I just don't know what. We will probably have a downtime in a couple of months where I can put the correct ones back in. Thanks, Jeff -- Jeff Cleverley Unix Systems Administrator 4380 Ziegler Road Fort Collins, Colorado 80525 970-288-4611

5 6

RE: nasty virus
by Klise, Steve 04 Feb '13

04 Feb '13

To fast on the trigger.. I don't want to start up the thread about AV on or not on the filers (I run it on my filer), but we got hit with a nasty variant of this. Not sure why Trend didn't block it, but basically a bunch of my CIF folders were set to read only/hidden.. Not fun. There are some other reasons that need to be mitigated that I wont go into, but this is the virus. http://about-threats.trendmicro.com/Malware.aspx?id=47409&name=WORM_VOBFUS.… To unhide the share, we were mapped to the share location, and ran this.. It unhides the shares lick-ity split. >From an XP box.. Cmd "usebackq delims=#" %i in (`dir *. /ah /s /b`) do attrib -s -h "%i" Just share because I care..

2 1

Re: Aggregate Disk Busy 100% with volume IOPS low
by Fletcher Cocquyt 27 Jan '13

27 Jan '13

Indeed, We are considering replacing our premium support with next day and using the savings to buy some professional services - we've heard other groups see better support ROI with this combination. On the wafl block reclamation - are you talking about options wafl.trunc.throttle.vol.max etc? we had to tune that back in 2010 7.3x days: http://www.vmadmin.info/2010/11/vfiler-migrate-netapp-lockup.html But not sure this is still a hidden option in 8.1.x? I read references to a tool called perfviewer - anyone still using that? thanks On Jan 26, 2013, at 1:45 PM, Isaiah <zoratu(a)gmail.com> wrote: > If I were you, I would purchase some incident-based support from Berkeley Communications ("Berkcom"). They're the only reseller of both used and new NetApp gear in the world. They know more about NetApp than NetApp. I've been a customer for nine years and they're the first number I call--because they're all experts. No escalation yadda yadda. Completely worth the modest fees to support gear not purchased through them. > > The last time your situation happened to me I ended up having to tune the wafl block reclamation aggressiveness. There were so many snapshots happening on the system with lots of gradual changes that the disk utilization was high just reclaiming blocks that used to be dirty. > -- > - Isaiah > > On Jan 26, 2013, at 10:15, Fletcher Cocquyt <fcocquyt(a)stanford.edu> wrote: > >> On Nick's advice I setup a job to log both wafltop and ps -c 1 once per minute - and we had a sustained sata0 disk busy from 5am-7am as reported by NMC. >> First question I have from wafltop show is - what is the first row (sata0::file i/o) reporting ? What could be the source of these 28907 non-volume specific Read IOs? >> >> Application MB Total MB Read(STD) MB Write(STD) Read IOs(STD) Write IOs(STD) >> ----------- -------- ------------ ------------- ------------- -------------- >> sata0::file i/o: 5860 5830 30 28907 0 >> sata0:backup:nfsv3: 608 0 608 31 0 >> >> I'm just starting to go through the data >> >> aggr status >> Aggr State Status Options >> sata0 online raid_dp, aggr nosnap=on, raidsize=12 >> 64-bit >> aggr2 online raid_dp, aggr nosnap=on, raidsize=19 >> 64-bit >> aggr1 online raid_dp, aggr root, nosnap=on, raidsize=14 >> 32-bit >> na04*> df -Ah >> Aggregate total used avail capacity >> aggr1 13TB 11TB 1431GB 89% >> aggr2 19TB 14TB 5305GB 74% >> sata0 27TB 19TB 8027GB 72% >> >> >> <sataIOPSJan26.jpeg> >> >> thanks >> >> >> On Jan 25, 2013, at 5:33 PM, Nicholas Bernstein <nick(a)nicholasbernstein.com> wrote: >> >>> Try doing a 'ps -c 1' or a wafltop show (double check the syntax) while you're getting the spike; those will probably help you narrow down the processes that are using your disks. Both are priv set advanced/diag commands. >>> >>> Nick >>> >> >> _______________________________________________ >> Toasters mailing list >> Toasters(a)teaparty.net >> http://www.teaparty.net/mailman/listinfo/toasters

1 0

Re: Aggregate Disk Busy 100% with volume IOPS low
by Fletcher Cocquyt 24 Jan '13

24 Jan '13

Hi Mike, this is a snap mirror SOURCE. The perfstats captured during the issue include wafl scan status - I'll mention it to the engineer when I show the disparity btw AGGR OPS and total volume OPS. Unfortunately its not easy to differentiate the source of IOPS with any existing tools in my experience - including perfstats and escalated Netapp support analysis. But the NMC AGGR vs volume stats need to be explained somehow. thanks On Jan 23, 2013, at 9:35 PM, Mike Nye <Mike.Nye(a)xpanse.com.au> wrote: > Hi Fletcher, > > Is this FAS a SnapMirror destination? This sounds very much like a volume deswizzle / CBR scan issue… > > You could try scheduling a “priv set advanced; wafl scan status” to run via RSH/SSH during the time of poor performance, to see what internal WAFL scans are taking place on the system. > > Thinks to look for are “Container Block Reclamation” and/or “Volume Deswizzle”. > > Kind regards, > Mike Nye > > > <image001.jpg> > <image002.jpg> > Mike Nye > Team Lead Systems Engineer > mike.nye(a)xpanse.com.au > www.xpanse.com.au > Mob: > Office: > Fax: > +61 407 772 465 > +61 8 9322 6767 > +61 8 9322 6077 > 18 Emerald Tce > West Perth > WA 6005 > <image003.jpg> > > > From: toasters-bounces(a)teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Fletcher Cocquyt > Sent: Thursday, 24 January 2013 1:28 PM > To: toasters(a)teaparty.net Lists > Subject: Aggregate Disk Busy 100% with volume IOPS low > > 3270 cluster, OnTAP 8.1-7mode > > We are investigating a SATA aggregate showing repeated 5am disk 100% busy spikes without its volumes showing any corresponding IOPS spike as reported by Netapp Management Console (NMC). > The 5am disk busy spikes correlate with very high latency on volumes on a different SAS aggregate. These volumes host VMs which then timeout, some needing reboots. > Today when I heard from Netapp support after reviewing my perfstat the engineer reported this is expected since NVRAM buffers are shared btw aggregates. > > But when I dig further into the NMC stats I see the SATA aggregate disk busy actually corresponds to a DROP in IOPS on the 3 volumes hosted on the SATA aggregate - almost like some internal aggregate operations are starving out the external volume ops. > I checked the snapshots (vol and aggr), snap mirror, dedup and none of the usual suspects were running. > > When I look at the NMC throughput graphs and switch on the legend - it shows a 5am READ blocks/sec spike corresponding perfectly to the disk busy. > > Where are these AGGR level READ operations coming from that are missing from the constituent volume IOPS, and in fact seem to be starving out volume level IO? > > I don't see much in the messages log, but will check the rest of the logs for internal type OPS > > thanks for any insight > > > > > >

2 1

Re: Aggregate Disk Busy 100% with volume IOPS low
by Fletcher Cocquyt 24 Jan '13

24 Jan '13

No NDMP in use for us - at a loss to explain this level of AGGR disk busy with no vol level IO Feels like an internal type operation hitting a bug thanks On Jan 23, 2013, at 10:21 PM, "Klise, Steve" <klises(a)sutterhealth.org> wrote: > A stab but what about ndmp jobs? > > From: Fletcher Cocquyt [mailto:fcocquyt@stanford.edu] > Sent: Wednesday, January 23, 2013 09:27 PM > To: toasters(a)teaparty.net Lists <toasters(a)teaparty.net> > Subject: Aggregate Disk Busy 100% with volume IOPS low > > 3270 cluster, OnTAP 8.1-7mode > > We are investigating a SATA aggregate showing repeated 5am disk 100% busy spikes without its volumes showing any corresponding IOPS spike as reported by Netapp Management Console (NMC). > The 5am disk busy spikes correlate with very high latency on volumes on a different SAS aggregate. These volumes host VMs which then timeout, some needing reboots. > Today when I heard from Netapp support after reviewing my perfstat the engineer reported this is expected since NVRAM buffers are shared btw aggregates. > > But when I dig further into the NMC stats I see the SATA aggregate disk busy actually corresponds to a DROP in IOPS on the 3 volumes hosted on the SATA aggregate - almost like some internal aggregate operations are starving out the external volume ops. > I checked the snapshots (vol and aggr), snap mirror, dedup and none of the usual suspects were running. > > When I look at the NMC throughput graphs and switch on the legend - it shows a 5am READ blocks/sec spike corresponding perfectly to the disk busy. > > Where are these AGGR level READ operations coming from that are missing from the constituent volume IOPS, and in fact seem to be starving out volume level IO? > > I don't see much in the messages log, but will check the rest of the logs for internal type OPS > > thanks for any insight > > > > > >

1 0

Vfiler DR activate
by Iluhes 22 Jan '13

22 Jan '13

Hello Toasters, After running VFILER DR Activate and failing over vFiler to DR site, none of the shares were available. The DC controller was not available in DR bubble, but we thought that local admin users on vfiler should have access to shares. We could not access shares. For certain vFiler DRs when the production side is a domain member, we have a need to have the filer available in a DR test situation in which the domain controller is not available. We have found in testing that in the absence of the domain controller, all share access is denied. Even when we try accessing with local accounts, which on a Windows file server, would have let us access the share in the absence of a domain controller we are denied access. Is that correct behavior? We would get: "There are currently no logon server available to service the logon request" Is there anyway to do vFiler DR where the vFiler is a domain member where one can access the shares in DR without the domain being present--using local accounts only. Thanks!!!

2 1

RE: Vfiler DR activate
by steve klise 22 Jan '13

22 Jan '13

Was there a delta in time between the filer and the clinet? Usually AD has to be within 5 minutes of source and destination for Kerberos. You would want an AD controller to be available, and usually the 1st box up for CIFS authentication... Sent from Windows Mail From: Iluhes Sent: ‎January‎ ‎21‎, ‎2013 ‎6‎:‎06‎ ‎PM To: toasters(a)teaparty.net Subject: Vfiler DR activate Hello Toasters, After running VFILER DR Activate and failing over vFiler to DR site, none of the shares were available. The DC controller was not available in DR bubble, but we thought that local admin users on vfiler should have access to shares. We could not access shares. For certain vFiler DRs when the production side is a domain member, we have a need to have the filer available in a DR test situation in which the domain controller is not available. We have found in testing that in the absence of the domain controller, all share access is denied. Even when we try accessing with local accounts, which on a Windows file server, would have let us access the share in the absence of a domain controller we are denied access. Is that correct behavior? We would get: "There are currently no logon server available to service the logon request" Is there anyway to do vFiler DR where the vFiler is a domain member where one can access the shares in DR without the domain being present--using local accounts only. Thanks!!!

1 0

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

toasters January 2013