deduplication, snapshots and performance questions - toasters

16 Feb 2012


      Hello All,
We're troubleshooting some performance issues and trying to determine how deduplication might be might be muddying the waters on an already busy v3170 HA filer pair. There's also some likelihood that snapshots are involved.
In a nutshell we have two Vfilers running on one node. One serves NFS datastores to our VMWare farm and all its volumes are de-duped. We recently migrated the other to the node, it's serving NFS to a wide variety of hosts and IO tends to be very heavy on metadata (get_attr and set_attr).
We've found a pretty clear correlation between running de-dupes and performance complaints from customers of the other Vfiler. Whether we're running one de-dupe session or eight, CPU3 stays in the mid-high nineties. This is consistent with what NetApp has told us: that de-dupe processes live in the "Kahuna" domain that always ends up on one proc (in our case, the last of the four). They've also told us that de-dupe processes are heavily "niced" and shouldn't impact other processes, but after just having finished reading their 75 page PDF TR-3505 ("NetApp Deduplication for FAS and V-Series Deployment and Implementation Guide"), it's pretty clear the answer is a huge "it depends" and that de-dupe processes have a very real impact on performance. Especially if we're talking about a system that's otherwise heavily used, that is a higher model number, and that is using ATA drives (bingo on all three).
Does anyone have any real-world experience with de-dupe processes impacting performance? Any suggestions for tuning? We've spent some time moving the schedules around but kept finding that efforts to spread them out were pretty fruitless: some volumes took much longer than others and eventually they started overlapping and stacking up. Right now we're running a script from another server that checks to see how many de-dupes are running and launches new ones as old ones complete, depending on what time it is. The end result is we run a constant loop over a list of fourteen VMWare datastore volumes with two de-dupes running on nights and weekends and none during the day. This means it takes about three days for that list to repeat; we used to run each volume every other day so we're watching the size numbers to make sure we don't hit any walls on the aggregates.
Here's a tangent question. Does anyone know if, everything else being equal, two equal de-dupe jobs will complete faster if they're run sequentially or in parallel? If this process really is confined to a single proc and the work to be done is directly related to the ops available the blocks to be changed, it seems like should be a wash, but we haven't collected enough data to figure this out for ourselves (I'm afraid if we spend too much time running at only one process we'll have space problems).
To add snapshots to the mix: we're running hourly snapshots on most or all of the volumes on both Vfilers. NetApp has told us this adds considerable performance load, most notably the work involved in releasing a snapshot as the oldest one rolls off the stack. They've said they can see instances in the logs where a snapshot begins before the previous one is finished. We're considering splitting the volumes into two groups and snapping them at alternate hours, so we get protection at two-hour granularity but each hour we're only processing half the volumes.
However. Won't we then be processing twice as many changed blocks for each snapshot? Do we gain anything? We had this exact problem, for example, with a different storage vendor with a similar snapshot implementation. In that case their solution was actually to step up to snapshots every half hour. It ended up that processing smaller hunks of changed blocks more often solved our problem.
Right now we're leaning toward trying a hybrid of these two approaches on our NetApp, splitting the volumes into two groups and running them at the top and bottom of the hour. We still get hourly snapshots on all volumes but only need to parse half as many in each pass.
Anyway, I've rambled a lot and asked not many specific questions. What I'm hoping to hear is about:
* Whether de-dupe processes scale linearly (do two jobs in parallel take the same total time as sequentially) 
* Anyone's experience tuning de-dupe processes and/or snapshots to minimize performance impact.
Hope to hear from you,
Randy