Yes, a colleague started a large snaprestore (1Tb on SATA aggr) and it ended up coinciding with the full backups late on the weekend. The datastore became unavailable via NFS - the 3rd shift support engineer had me on the line waiting, for an hour before I suggested we just reboot. It was another hour before I insisted we just reboot the head and service was restored on NFS - then I revovered several VMs.
I never use snaprestore personally, it is very slow - I recommend a 10g attached rsync host to recover directly from the .snapshot dir and rsync provides throughput and progress stats and can be restarted if interrupted.
This is likely a snaprestore/NFS related bug in ontap - please let me know if you get any RCA from your perfstats!
Cheers, Fletcher.
On Sep 17, 2014, at 6:30 PM, Ray Van Dolson rvandolson@esri.com wrote:
That's something we're definitely keeping in mind as we put together our own internal RCA. This particular box *was* quite busy with the SATA disks in question at times oversaturated. Perhaps our snaprestore issue would not have reared its head absent some of that oversaturation? It certainly could have contributed to creating conditions where snaprestore could cause the side effects we observed.
With that said, it did not appear that snaprestore running was introducing new "load" -- at least from a metrics standpoint. OnCommand graphs didn't show anything different than what I'd quantify as typical load. We couldn't even tell visually where snaprestore kicked in from the graphs... based on this we initially discounted that snaprestore could be causing the problems...
Fletcher, did your issue occur on a potentially oversaturated environment?
Thanks for all the replies.
Ray
On Thu, Sep 18, 2014 at 01:24:26AM +0000, Parisi, Justin wrote:
If you are rebooting the controller, you might as well core the box. That may help in analysis of the issue.
Keep in mind that if you¹re hammering disks in a system with something external (like NDMP) you can affect other protocols, such as CIFS and NFS. The system has limited resources available to it, and pegging out disks, CPU, RAM, etc can impact everyone. Perfstat would be able to verify if you¹re pegging resources. If it¹s not a resource issue with hardware and is a software bug, a core file would help verify that.
On 9/17/14, 8:28 PM, "Ray Van Dolson" rvandolson@esri.com wrote:
Hmm. And you're on a version fairly close to ours. For us, NFS service actually recovered on its own -- after 30 minutes or so of "impact". Then it would be stable for a while and the issue would return. Rinse & repeat. Rebooting the controller did expedite recovery (though didn't prevent reocurrence).
We don't have a bug #, but did manage to capture a perfstat during one of the outages. We'll keep pushing on this...
Ray
On Wed, Sep 17, 2014 at 05:20:53PM -0700, Fletcher Cocquyt wrote:
We experienced the same NFS outage on a 2240 SATA aggr running 8.1.2. We ended up having to reboot the filer to recover NFS service. Is there a bug number for this issue? We opened a case but were told without a perfstat from the incident there was not much diagnostic info to go on.
thanks
On Sep 17, 2014, at 4:48 PM, Ray Van Dolson rvandolson@esri.com
wrote:
I'll add that this issue seems very similiar:
https://communities.netapp.com/thread/12180
Though on a much older version of ONTAP (well, presumably -- the OP doesn't exactly state what they're running, but it is from 2010).
Ray
On Wed, Sep 17, 2014 at 04:04:23PM -0700, Ray Van Dolson wrote: Thanks for the reply. ndmpcopy is probably faster, though we've used single-file snaprestore in the past with no issues (but hadn't used
it
since upgrading to 8.1.2P4).
It's interesting to me that no other functionality on the filer (at least as far as we're aware) was impacted other than NFS.
We'll work with IBM to see if this is a known issue or something new. Suppor tells us the behavior we observed is absolutely not expected.
Ray
> On Wed, Sep 17, 2014 at 08:50:44PM +0000, Jordan Slingerland wrote: > I have heard of some issues with single file snap restore in 'older' > version...maybe fixed in 8.2?, I am not sure. I always use ndmpcopy > over snapstore when possible. I would suggest that as an
alternative,
> though I know that does not exactly answer your question. > > > --Jordan > > -----Original Message----- > From: toasters-bounces@teaparty.net
[mailto:toasters-bounces@teaparty.net] On Behalf Of Ray Van Dolson
> Sent: Wednesday, September 17, 2014 4:35 PM > To: toasters@teaparty.net > Subject: Single-file Snaprestore Causing Performance Impact? > > Hi all; > > Running 8.1.2P4 in 7-Mode on an IBM N6240. We initiated a couple of > single-file snaprestores which ran for 15+ hours on some busy > SATA-based aggregates). During that time, we experienced > intermittent issues connecting to the NFS services on this filer. > Issues would clear up after a while (minutes or tens of minutes) and > then return an hour or so later. > > We killed the snaprestores during one of the outages and observed a > full recovery of the NFS service. It may have been coincidental. > > Anyone aware of snaprestore (specifically, single-file restores) > causing cascading impacts? > > OnCommand doesn't show any additional spike in CPU, disk activity, > etc.... > > Thanks, > Ray