toasters February 2012

toasters@lists.teaparty.net

52 participants
39 discussions

deduplication, snapshots and performance questions
by Randy Rue 16 Feb '12

16 Feb '12

Hello All, We're troubleshooting some performance issues and trying to determine how deduplication might be might be muddying the waters on an already busy v3170 HA filer pair. There's also some likelihood that snapshots are involved. In a nutshell we have two Vfilers running on one node. One serves NFS datastores to our VMWare farm and all its volumes are de-duped. We recently migrated the other to the node, it's serving NFS to a wide variety of hosts and IO tends to be very heavy on metadata (get_attr and set_attr). We've found a pretty clear correlation between running de-dupes and performance complaints from customers of the other Vfiler. Whether we're running one de-dupe session or eight, CPU3 stays in the mid-high nineties. This is consistent with what NetApp has told us: that de-dupe processes live in the "Kahuna" domain that always ends up on one proc (in our case, the last of the four). They've also told us that de-dupe processes are heavily "niced" and shouldn't impact other processes, but after just having finished reading their 75 page PDF TR-3505 ("NetApp Deduplication for FAS and V-Series Deployment and Implementation Guide"), it's pretty clear the answer is a huge "it depends" and that de-dupe processes have a very real impact on performance. Especially if we're talking about a system that's otherwise heavily used, that is a higher model number, and that is using ATA drives (bingo on all three). Does anyone have any real-world experience with de-dupe processes impacting performance? Any suggestions for tuning? We've spent some time moving the schedules around but kept finding that efforts to spread them out were pretty fruitless: some volumes took much longer than others and eventually they started overlapping and stacking up. Right now we're running a script from another server that checks to see how many de-dupes are running and launches new ones as old ones complete, depending on what time it is. The end result is we run a constant loop over a list of fourteen VMWare datastore volumes with two de-dupes running on nights and weekends and none during the day. This means it takes about three days for that list to repeat; we used to run each volume every other day so we're watching the size numbers to make sure we don't hit any walls on the aggregates. Here's a tangent question. Does anyone know if, everything else being equal, two equal de-dupe jobs will complete faster if they're run sequentially or in parallel? If this process really is confined to a single proc and the work to be done is directly related to the ops available the blocks to be changed, it seems like should be a wash, but we haven't collected enough data to figure this out for ourselves (I'm afraid if we spend too much time running at only one process we'll have space problems). To add snapshots to the mix: we're running hourly snapshots on most or all of the volumes on both Vfilers. NetApp has told us this adds considerable performance load, most notably the work involved in releasing a snapshot as the oldest one rolls off the stack. They've said they can see instances in the logs where a snapshot begins before the previous one is finished. We're considering splitting the volumes into two groups and snapping them at alternate hours, so we get protection at two-hour granularity but each hour we're only processing half the volumes. However. Won't we then be processing twice as many changed blocks for each snapshot? Do we gain anything? We had this exact problem, for example, with a different storage vendor with a similar snapshot implementation. In that case their solution was actually to step up to snapshots every half hour. It ended up that processing smaller hunks of changed blocks more often solved our problem. Right now we're leaning toward trying a hybrid of these two approaches on our NetApp, splitting the volumes into two groups and running them at the top and bottom of the hour. We still get hourly snapshots on all volumes but only need to parse half as many in each pass. Anyway, I've rambled a lot and asked not many specific questions. What I'm hoping to hear is about: * Whether de-dupe processes scale linearly (do two jobs in parallel take the same total time as sequentially) * Anyone's experience tuning de-dupe processes and/or snapshots to minimize performance impact. Hope to hear from you, Randy

5 4

QFULL / vFiler
by Stuart Kendrick 15 Feb '12

15 Feb '12

How do I look for QFULLs on a vFiler? FAS3020 lun-stats -o -a /vol/vmware/vmware_db.lun (67 days, 5 hours, 14 minutes, 4 seconds) Read (kbytes) Write (kbytes) Read Ops Write Ops Other Ops QFulls Partner Ops Partner KBytes 0 0 0 0 0 0 0 0 V3170 lun-stats -o -a [empty output] --sk

1 0

Flash Cache questions: symmetrical cache mode required? Metadata mode only with a 512GB Flash Cache?
by Robert McDermott 11 Feb '12

11 Feb '12

Hello, We a V3170 cluster with a 512GB Flash Cache module installed in each controller. Each module is currently configured to cache normal data (default setting): flexscale.enable on flexscale.lopri_blocks off flexscale.normal_data_blocks on We have a vfiler running on one controller that does very little normal IO but has a very heavy metadata load due to poor application design. This vfiler has poor performance and its function is critical. We are thinking about switching to metadata only caching mode (flexscale.lopri_blocks off, flexscale.normal_data_blocks off) to improve its performance but have a couple of questions: The Flash Cache best practices guide has the following verbage about enabling metadata only mode: "Because of the much larger size of Flash Cache, this mode is more applicable to PAM I, the original 16GB DRAM–based Performance Acceleration Module, than Flash Cache." Does that mean that this setting doesn't apply (not recommended/supported) for Flash Cache? but is for PAM I? Is using metadata only mode a bad idea with a large flash cache module? If so why? The best practices guide also indicates that it's recommended to have a symmetrical number and size of modules between controllers in a cluster, but it doesn't say anything about symmetrical cache mode settings. Is it OK to have one controller's flash cache set to the normal data setting, but the others flash cache set to metadata only? During a failover the cache of the failed controller is essentially lost (doesn't follow it to the remaining controller) so it doesn't seem like it would matter as long as the cluster didn't stay in this failover state for a long period of time. What are your thoughts on this? Thanks in advance, -Robert

9 8

tracking aggr initialization progress ?
by Fletcher Cocquyt 10 Feb '12

10 Feb '12

Is there a way to track the progress (% complete?) for aggr creation/initialization? aggr status Aggr State Status Options aggr1 creating raid_dp, aggr raidsize=20, initializing snapshot_autodelete=off, 32-bit lost_write_protect=off thanks Fletcher

4 3

Re: The following prevents offline of volume: zapi snapshot list
by Fletcher Cocquyt 10 Feb '12

10 Feb '12

The snapshot process must have completed - I was able to offline by waiting a few minutes thanks Fletcher On Feb 10, 2012, at 2:31 PM, Parisi, Justin wrote: > Are you able to reboot the box? Because those will need to complete otherwise before it allows a vol offline. > > If you try to force the vol offline, it will hang. Reboot will clear this up and you should be able to offline the vols. > > From: toasters-bounces(a)teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Fletcher Cocquyt > Sent: Friday, February 10, 2012 4:57 PM > To: toasters(a)teaparty.net > Subject: The following prevents offline of volume: zapi snapshot list > > Hi > I'm attempting to offline all the volumes on this 64bit aggr so I can destroy and re-create as 32 bit (so 32->32 bit data motion is supported) > > I am encountering this error: > > The following prevents offline of volume vm2 > zapi snapshot list > zapi snapshot list > zapi snapshot list > > How do I clear it so I can proceed? > > thanks > > Fletcher > > > > > >

1 0

The following prevents offline of volume: zapi snapshot list
by Fletcher Cocquyt 10 Feb '12

10 Feb '12

Hi I'm attempting to offline all the volumes on this 64bit aggr so I can destroy and re-create as 32 bit (so 32->32 bit data motion is supported) I am encountering this error: The following prevents offline of volume vm2 zapi snapshot list zapi snapshot list zapi snapshot list How do I clear it so I can proceed? thanks Fletcher

1 0

dedup changed in 8.1?
by Fletcher Cocquyt 10 Feb '12

10 Feb '12

Last night (first since the upgrade from 7.3.5.1P2 to 8.1RC2) we experienced much higher load on our 3270 cluster and some VM timeouts coinciding with our dedup schedule (11pm -2am) Anyone heard if dedup changed in 8.1? I am disabling the regular dedup schedule until this is sorted out thanks Fletcher

5 5

nfsstats help
by Jeremy Page 09 Feb '12

09 Feb '12

How do I find out what's causing these messages and what the severity is? With large chunks taken out: Server rpc: TCP: calls badcalls nullrecv badlen xdrcall 72337 4294847830 0 0 4294847830 No of RAID errors propagated by WAFL = 2184604 It certainly looks bad but I am not sure if this is a big deal or not. I've been seeing some ugly latencies on my NFS traffic over the past day or so and I'm trying to determine what it could be. They are running in the 15-20 ms range (from stats show -i5 nfsv3) where normally it's under 5ms. Please be advised that this email may contain confidential information. If you are not the intended recipient, please notify us by email by replying to the sender and delete this message. The sender disclaims that the content of this email constitutes an offer to enter into, or the acceptance of, any agreement; provided that the foregoing does not invalidate the binding effect of any digital or other electronic reproduction of a manual signature that is included in any attachment.

1 0

Force 64bit aggr migration
by Fletcher Cocquyt 09 Feb '12

09 Feb '12

Anyone know if its possible to force the in place 32->64bit aggr non-disruptive migration without having to add disk? I'm running into a 8.1R2 data motion issue where NMC is refusing to migrate vFilers from the 32 bit source cluster to the 64bit destination. The destination aggrs are at or above 16Tb so they need to be 64 bit thanks Fletcher

4 4

Fas 3240 failover delay
by John Sfetsoris 09 Feb '12

09 Feb '12

Hello everybody I have a Fas3204 HA in two seperate chassis , dataontap version 8.02P3 7mode. Many NFS servers are connected to it via iscsi connections. NFS servers are VMware mashines . VMware version is VSPERE 5. There are 4-5 Vfilers running. When i force a failover from the one controler to other, the NFS clients have a delay about 30-40 seconds. Is this normal? Can i do something to investigate/tune the failover proccess? thnks in advance John

5 6

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

toasters February 2012