Single-file Snaprestore Causing Performance Impact?

List overview All Threads
Download

newer

older

vFiler DR in 7Mode vs Cluster...

sis.max_total_share_workers &...

Ray Van Dolson

17 Sep 2014 17 Sep '14

8:34 p.m.

Hi all;

Running 8.1.2P4 in 7-Mode on an IBM N6240. We initiated a couple of single-file snaprestores which ran for 15+ hours on some busy SATA-based aggregates). During that time, we experienced intermittent issues connecting to the NFS services on this filer. Issues would clear up after a while (minutes or tens of minutes) and then return an hour or so later.

We killed the snaprestores during one of the outages and observed a full recovery of the NFS service. It may have been coincidental.

Anyone aware of snaprestore (specifically, single-file restores) causing cascading impacts?

OnCommand doesn't show any additional spike in CPU, disk activity, etc....

Thanks, Ray

Show replies by date

Jordan Slingerland

17 Sep 17 Sep

8:50 p.m.

I have heard of some issues with single file snap restore in 'older' version...maybe fixed in 8.2?, I am not sure. I always use ndmpcopy over snapstore when possible. I would suggest that as an alternative, though I know that does not exactly answer your question.

--Jordan

-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Ray Van Dolson Sent: Wednesday, September 17, 2014 4:35 PM To: toasters@teaparty.net Subject: Single-file Snaprestore Causing Performance Impact?

Hi all;

We killed the snaprestores during one of the outages and observed a full recovery of the NFS service. It may have been coincidental.

Anyone aware of snaprestore (specifically, single-file restores) causing cascading impacts?

OnCommand doesn't show any additional spike in CPU, disk activity, etc....

Thanks, Ray _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Ray Van Dolson

11:04 p.m.

Thanks for the reply. ndmpcopy is probably faster, though we've used single-file snaprestore in the past with no issues (but hadn't used it since upgrading to 8.1.2P4).

It's interesting to me that no other functionality on the filer (at least as far as we're aware) was impacted other than NFS.

We'll work with IBM to see if this is a known issue or something new. Suppor tells us the behavior we observed is absolutely not expected.

Ray

On Wed, Sep 17, 2014 at 08:50:44PM +0000, Jordan Slingerland wrote:

...

I have heard of some issues with single file snap restore in 'older' version...maybe fixed in 8.2?, I am not sure. I always use ndmpcopy over snapstore when possible. I would suggest that as an alternative, though I know that does not exactly answer your question.

--Jordan

-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Ray Van Dolson Sent: Wednesday, September 17, 2014 4:35 PM To: toasters@teaparty.net Subject: Single-file Snaprestore Causing Performance Impact?

Hi all;

Running 8.1.2P4 in 7-Mode on an IBM N6240. We initiated a couple of single-file snaprestores which ran for 15+ hours on some busy SATA-based aggregates). During that time, we experienced intermittent issues connecting to the NFS services on this filer. Issues would clear up after a while (minutes or tens of minutes) and then return an hour or so later.

We killed the snaprestores during one of the outages and observed a full recovery of the NFS service. It may have been coincidental.

Anyone aware of snaprestore (specifically, single-file restores) causing cascading impacts?

OnCommand doesn't show any additional spike in CPU, disk activity, etc....

Thanks, Ray

Ray Van Dolson

11:45 p.m.

I'll add that this issue seems very similiar:

https://communities.netapp.com/thread/12180

Though on a much older version of ONTAP (well, presumably -- the OP doesn't exactly state what they're running, but it is from 2010).

Ray

On Wed, Sep 17, 2014 at 04:04:23PM -0700, Ray Van Dolson wrote:

...

Thanks for the reply. ndmpcopy is probably faster, though we've used single-file snaprestore in the past with no issues (but hadn't used it since upgrading to 8.1.2P4).

It's interesting to me that no other functionality on the filer (at least as far as we're aware) was impacted other than NFS.

We'll work with IBM to see if this is a known issue or something new. Suppor tells us the behavior we observed is absolutely not expected.

Ray

On Wed, Sep 17, 2014 at 08:50:44PM +0000, Jordan Slingerland wrote:

...
I have heard of some issues with single file snap restore in 'older' version...maybe fixed in 8.2?, I am not sure. I always use ndmpcopy over snapstore when possible. I would suggest that as an alternative, though I know that does not exactly answer your question.

--Jordan

-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Ray Van Dolson Sent: Wednesday, September 17, 2014 4:35 PM To: toasters@teaparty.net Subject: Single-file Snaprestore Causing Performance Impact?

Hi all;

Running 8.1.2P4 in 7-Mode on an IBM N6240. We initiated a couple of single-file snaprestores which ran for 15+ hours on some busy SATA-based aggregates). During that time, we experienced intermittent issues connecting to the NFS services on this filer. Issues would clear up after a while (minutes or tens of minutes) and then return an hour or so later.

We killed the snaprestores during one of the outages and observed a full recovery of the NFS service. It may have been coincidental.

Anyone aware of snaprestore (specifically, single-file restores) causing cascading impacts?

OnCommand doesn't show any additional spike in CPU, disk activity, etc....

Thanks, Ray

Fletcher Cocquyt

18 Sep 18 Sep

12:20 a.m.

We experienced the same NFS outage on a 2240 SATA aggr running 8.1.2. We ended up having to reboot the filer to recover NFS service. Is there a bug number for this issue? We opened a case but were told without a perfstat from the incident there was not much diagnostic info to go on.

thanks

...

On Sep 17, 2014, at 4:48 PM, Ray Van Dolson rvandolson@esri.com wrote:

I'll add that this issue seems very similiar:

https://communities.netapp.com/thread/12180

Though on a much older version of ONTAP (well, presumably -- the OP doesn't exactly state what they're running, but it is from 2010).

Ray

...
On Wed, Sep 17, 2014 at 04:04:23PM -0700, Ray Van Dolson wrote: Thanks for the reply. ndmpcopy is probably faster, though we've used single-file snaprestore in the past with no issues (but hadn't used it since upgrading to 8.1.2P4).

It's interesting to me that no other functionality on the filer (at least as far as we're aware) was impacted other than NFS.

We'll work with IBM to see if this is a known issue or something new. Suppor tells us the behavior we observed is absolutely not expected.

Ray

...
On Wed, Sep 17, 2014 at 08:50:44PM +0000, Jordan Slingerland wrote: I have heard of some issues with single file snap restore in 'older' version...maybe fixed in 8.2?, I am not sure. I always use ndmpcopy over snapstore when possible. I would suggest that as an alternative, though I know that does not exactly answer your question.

--Jordan

-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Ray Van Dolson Sent: Wednesday, September 17, 2014 4:35 PM To: toasters@teaparty.net Subject: Single-file Snaprestore Causing Performance Impact?

Hi all;

Running 8.1.2P4 in 7-Mode on an IBM N6240. We initiated a couple of single-file snaprestores which ran for 15+ hours on some busy SATA-based aggregates). During that time, we experienced intermittent issues connecting to the NFS services on this filer. Issues would clear up after a while (minutes or tens of minutes) and then return an hour or so later.

We killed the snaprestores during one of the outages and observed a full recovery of the NFS service. It may have been coincidental.

Anyone aware of snaprestore (specifically, single-file restores) causing cascading impacts?

OnCommand doesn't show any additional spike in CPU, disk activity, etc....

Thanks, Ray

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Ray Van Dolson

12:28 a.m.

Hmm. And you're on a version fairly close to ours. For us, NFS service actually recovered on its own -- after 30 minutes or so of "impact". Then it would be stable for a while and the issue would return. Rinse & repeat. Rebooting the controller did expedite recovery (though didn't prevent reocurrence).

We don't have a bug #, but did manage to capture a perfstat during one of the outages. We'll keep pushing on this...

Ray

On Wed, Sep 17, 2014 at 05:20:53PM -0700, Fletcher Cocquyt wrote:

...

We experienced the same NFS outage on a 2240 SATA aggr running 8.1.2. We ended up having to reboot the filer to recover NFS service. Is there a bug number for this issue? We opened a case but were told without a perfstat from the incident there was not much diagnostic info to go on.

thanks

...
On Sep 17, 2014, at 4:48 PM, Ray Van Dolson rvandolson@esri.com wrote:

I'll add that this issue seems very similiar:

https://communities.netapp.com/thread/12180

Though on a much older version of ONTAP (well, presumably -- the OP doesn't exactly state what they're running, but it is from 2010).

Ray

...
On Wed, Sep 17, 2014 at 04:04:23PM -0700, Ray Van Dolson wrote: Thanks for the reply. ndmpcopy is probably faster, though we've used single-file snaprestore in the past with no issues (but hadn't used it since upgrading to 8.1.2P4).

It's interesting to me that no other functionality on the filer (at least as far as we're aware) was impacted other than NFS.

We'll work with IBM to see if this is a known issue or something new. Suppor tells us the behavior we observed is absolutely not expected.

Ray

...
On Wed, Sep 17, 2014 at 08:50:44PM +0000, Jordan Slingerland wrote: I have heard of some issues with single file snap restore in 'older' version...maybe fixed in 8.2?, I am not sure. I always use ndmpcopy over snapstore when possible. I would suggest that as an alternative, though I know that does not exactly answer your question.

--Jordan

-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Ray Van Dolson Sent: Wednesday, September 17, 2014 4:35 PM To: toasters@teaparty.net Subject: Single-file Snaprestore Causing Performance Impact?

Hi all;

Running 8.1.2P4 in 7-Mode on an IBM N6240. We initiated a couple of single-file snaprestores which ran for 15+ hours on some busy SATA-based aggregates). During that time, we experienced intermittent issues connecting to the NFS services on this filer. Issues would clear up after a while (minutes or tens of minutes) and then return an hour or so later.

We killed the snaprestores during one of the outages and observed a full recovery of the NFS service. It may have been coincidental.

Anyone aware of snaprestore (specifically, single-file restores) causing cascading impacts?

OnCommand doesn't show any additional spike in CPU, disk activity, etc....

Thanks, Ray

Parisi, Justin

1:24 a.m.

If you are rebooting the controller, you might as well core the box. That may help in analysis of the issue.

Keep in mind that if you¹re hammering disks in a system with something external (like NDMP) you can affect other protocols, such as CIFS and NFS. The system has limited resources available to it, and pegging out disks, CPU, RAM, etc can impact everyone. Perfstat would be able to verify if you¹re pegging resources. If it¹s not a resource issue with hardware and is a software bug, a core file would help verify that.

On 9/17/14, 8:28 PM, "Ray Van Dolson" rvandolson@esri.com wrote:

...

Hmm. And you're on a version fairly close to ours. For us, NFS service actually recovered on its own -- after 30 minutes or so of "impact". Then it would be stable for a while and the issue would return. Rinse & repeat. Rebooting the controller did expedite recovery (though didn't prevent reocurrence).

We don't have a bug #, but did manage to capture a perfstat during one of the outages. We'll keep pushing on this...

Ray

On Wed, Sep 17, 2014 at 05:20:53PM -0700, Fletcher Cocquyt wrote:

...
We experienced the same NFS outage on a 2240 SATA aggr running 8.1.2. We ended up having to reboot the filer to recover NFS service. Is there a bug number for this issue? We opened a case but were told without a perfstat from the incident there was not much diagnostic info to go on.

thanks

...
On Sep 17, 2014, at 4:48 PM, Ray Van Dolson rvandolson@esri.com

wrote:

...
I'll add that this issue seems very similiar:

https://communities.netapp.com/thread/12180

Though on a much older version of ONTAP (well, presumably -- the OP doesn't exactly state what they're running, but it is from 2010).

Ray

...
On Wed, Sep 17, 2014 at 04:04:23PM -0700, Ray Van Dolson wrote: Thanks for the reply. ndmpcopy is probably faster, though we've used single-file snaprestore in the past with no issues (but hadn't used

it

...
...
since upgrading to 8.1.2P4).

It's interesting to me that no other functionality on the filer (at least as far as we're aware) was impacted other than NFS.

We'll work with IBM to see if this is a known issue or something new. Suppor tells us the behavior we observed is absolutely not expected.

Ray

...
On Wed, Sep 17, 2014 at 08:50:44PM +0000, Jordan Slingerland wrote: I have heard of some issues with single file snap restore in 'older' version...maybe fixed in 8.2?, I am not sure. I always use ndmpcopy over snapstore when possible. I would suggest that as an

alternative,

...
...
...
though I know that does not exactly answer your question.

--Jordan

-----Original Message----- From: toasters-bounces@teaparty.net

[mailto:toasters-bounces@teaparty.net] On Behalf Of Ray Van Dolson

...
...
...
Sent: Wednesday, September 17, 2014 4:35 PM To: toasters@teaparty.net Subject: Single-file Snaprestore Causing Performance Impact?

Hi all;

Running 8.1.2P4 in 7-Mode on an IBM N6240. We initiated a couple of single-file snaprestores which ran for 15+ hours on some busy SATA-based aggregates). During that time, we experienced intermittent issues connecting to the NFS services on this filer. Issues would clear up after a while (minutes or tens of minutes) and then return an hour or so later.

We killed the snaprestores during one of the outages and observed a full recovery of the NFS service. It may have been coincidental.

Anyone aware of snaprestore (specifically, single-file restores) causing cascading impacts?

OnCommand doesn't show any additional spike in CPU, disk activity, etc....

Thanks, Ray

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Ray Van Dolson

1:30 a.m.

That's something we're definitely keeping in mind as we put together our own internal RCA. This particular box *was* quite busy with the SATA disks in question at times oversaturated. Perhaps our snaprestore issue would not have reared its head absent some of that oversaturation? It certainly could have contributed to creating conditions where snaprestore could cause the side effects we observed.

With that said, it did not appear that snaprestore running was introducing new "load" -- at least from a metrics standpoint. OnCommand graphs didn't show anything different than what I'd quantify as typical load. We couldn't even tell visually where snaprestore kicked in from the graphs... based on this we initially discounted that snaprestore could be causing the problems...

Fletcher, did your issue occur on a potentially oversaturated environment?

Thanks for all the replies.

Ray

On Thu, Sep 18, 2014 at 01:24:26AM +0000, Parisi, Justin wrote:

...

If you are rebooting the controller, you might as well core the box. That may help in analysis of the issue.

Keep in mind that if you¹re hammering disks in a system with something external (like NDMP) you can affect other protocols, such as CIFS and NFS. The system has limited resources available to it, and pegging out disks, CPU, RAM, etc can impact everyone. Perfstat would be able to verify if you¹re pegging resources. If it¹s not a resource issue with hardware and is a software bug, a core file would help verify that.

On 9/17/14, 8:28 PM, "Ray Van Dolson" rvandolson@esri.com wrote:

...
Hmm. And you're on a version fairly close to ours. For us, NFS service actually recovered on its own -- after 30 minutes or so of "impact". Then it would be stable for a while and the issue would return. Rinse & repeat. Rebooting the controller did expedite recovery (though didn't prevent reocurrence).

We don't have a bug #, but did manage to capture a perfstat during one of the outages. We'll keep pushing on this...

Ray

On Wed, Sep 17, 2014 at 05:20:53PM -0700, Fletcher Cocquyt wrote:

...
We experienced the same NFS outage on a 2240 SATA aggr running 8.1.2. We ended up having to reboot the filer to recover NFS service. Is there a bug number for this issue? We opened a case but were told without a perfstat from the incident there was not much diagnostic info to go on.

thanks

...
On Sep 17, 2014, at 4:48 PM, Ray Van Dolson rvandolson@esri.com

wrote:

...
I'll add that this issue seems very similiar:

https://communities.netapp.com/thread/12180

Though on a much older version of ONTAP (well, presumably -- the OP doesn't exactly state what they're running, but it is from 2010).

Ray

...
On Wed, Sep 17, 2014 at 04:04:23PM -0700, Ray Van Dolson wrote: Thanks for the reply. ndmpcopy is probably faster, though we've used single-file snaprestore in the past with no issues (but hadn't used

it

...
...
since upgrading to 8.1.2P4).

It's interesting to me that no other functionality on the filer (at least as far as we're aware) was impacted other than NFS.

We'll work with IBM to see if this is a known issue or something new. Suppor tells us the behavior we observed is absolutely not expected.

Ray

...
On Wed, Sep 17, 2014 at 08:50:44PM +0000, Jordan Slingerland wrote: I have heard of some issues with single file snap restore in 'older' version...maybe fixed in 8.2?, I am not sure. I always use ndmpcopy over snapstore when possible. I would suggest that as an

alternative,

...
...
...
though I know that does not exactly answer your question.

--Jordan

-----Original Message----- From: toasters-bounces@teaparty.net

[mailto:toasters-bounces@teaparty.net] On Behalf Of Ray Van Dolson

...
...
...
Sent: Wednesday, September 17, 2014 4:35 PM To: toasters@teaparty.net Subject: Single-file Snaprestore Causing Performance Impact?

Hi all;

Running 8.1.2P4 in 7-Mode on an IBM N6240. We initiated a couple of single-file snaprestores which ran for 15+ hours on some busy SATA-based aggregates). During that time, we experienced intermittent issues connecting to the NFS services on this filer. Issues would clear up after a while (minutes or tens of minutes) and then return an hour or so later.

We killed the snaprestores during one of the outages and observed a full recovery of the NFS service. It may have been coincidental.

Anyone aware of snaprestore (specifically, single-file restores) causing cascading impacts?

OnCommand doesn't show any additional spike in CPU, disk activity, etc....

Thanks, Ray

Fletcher Cocquyt

19 Sep 19 Sep

3:24 a.m.

Yes, a colleague started a large snaprestore (1Tb on SATA aggr) and it ended up coinciding with the full backups late on the weekend. The datastore became unavailable via NFS - the 3rd shift support engineer had me on the line waiting, for an hour before I suggested we just reboot. It was another hour before I insisted we just reboot the head and service was restored on NFS - then I revovered several VMs.

I never use snaprestore personally, it is very slow - I recommend a 10g attached rsync host to recover directly from the .snapshot dir and rsync provides throughput and progress stats and can be restarted if interrupted.

This is likely a snaprestore/NFS related bug in ontap - please let me know if you get any RCA from your perfstats!

Cheers, Fletcher.

On Sep 17, 2014, at 6:30 PM, Ray Van Dolson rvandolson@esri.com wrote:

...

That's something we're definitely keeping in mind as we put together our own internal RCA. This particular box *was* quite busy with the SATA disks in question at times oversaturated. Perhaps our snaprestore issue would not have reared its head absent some of that oversaturation? It certainly could have contributed to creating conditions where snaprestore could cause the side effects we observed.

With that said, it did not appear that snaprestore running was introducing new "load" -- at least from a metrics standpoint. OnCommand graphs didn't show anything different than what I'd quantify as typical load. We couldn't even tell visually where snaprestore kicked in from the graphs... based on this we initially discounted that snaprestore could be causing the problems...

Fletcher, did your issue occur on a potentially oversaturated environment?

Thanks for all the replies.

Ray

On Thu, Sep 18, 2014 at 01:24:26AM +0000, Parisi, Justin wrote:

...
If you are rebooting the controller, you might as well core the box. That may help in analysis of the issue.

Keep in mind that if you¹re hammering disks in a system with something external (like NDMP) you can affect other protocols, such as CIFS and NFS. The system has limited resources available to it, and pegging out disks, CPU, RAM, etc can impact everyone. Perfstat would be able to verify if you¹re pegging resources. If it¹s not a resource issue with hardware and is a software bug, a core file would help verify that.

On 9/17/14, 8:28 PM, "Ray Van Dolson" rvandolson@esri.com wrote:

...
Hmm. And you're on a version fairly close to ours. For us, NFS service actually recovered on its own -- after 30 minutes or so of "impact". Then it would be stable for a while and the issue would return. Rinse & repeat. Rebooting the controller did expedite recovery (though didn't prevent reocurrence).

We don't have a bug #, but did manage to capture a perfstat during one of the outages. We'll keep pushing on this...

Ray

On Wed, Sep 17, 2014 at 05:20:53PM -0700, Fletcher Cocquyt wrote:

...
We experienced the same NFS outage on a 2240 SATA aggr running 8.1.2. We ended up having to reboot the filer to recover NFS service. Is there a bug number for this issue? We opened a case but were told without a perfstat from the incident there was not much diagnostic info to go on.

thanks

...
On Sep 17, 2014, at 4:48 PM, Ray Van Dolson rvandolson@esri.com

wrote:

...
I'll add that this issue seems very similiar:

https://communities.netapp.com/thread/12180

Though on a much older version of ONTAP (well, presumably -- the OP doesn't exactly state what they're running, but it is from 2010).

Ray

...
On Wed, Sep 17, 2014 at 04:04:23PM -0700, Ray Van Dolson wrote: Thanks for the reply. ndmpcopy is probably faster, though we've used single-file snaprestore in the past with no issues (but hadn't used

it

...
...
since upgrading to 8.1.2P4).

It's interesting to me that no other functionality on the filer (at least as far as we're aware) was impacted other than NFS.

We'll work with IBM to see if this is a known issue or something new. Suppor tells us the behavior we observed is absolutely not expected.

Ray

> On Wed, Sep 17, 2014 at 08:50:44PM +0000, Jordan Slingerland wrote: > I have heard of some issues with single file snap restore in 'older' > version...maybe fixed in 8.2?, I am not sure. I always use ndmpcopy > over snapstore when possible. I would suggest that as an

alternative,

...
...
> though I know that does not exactly answer your question. > > > --Jordan > > -----Original Message----- > From: toasters-bounces@teaparty.net

[mailto:toasters-bounces@teaparty.net] On Behalf Of Ray Van Dolson

...
...
> Sent: Wednesday, September 17, 2014 4:35 PM > To: toasters@teaparty.net > Subject: Single-file Snaprestore Causing Performance Impact? > > Hi all; > > Running 8.1.2P4 in 7-Mode on an IBM N6240. We initiated a couple of > single-file snaprestores which ran for 15+ hours on some busy > SATA-based aggregates). During that time, we experienced > intermittent issues connecting to the NFS services on this filer. > Issues would clear up after a while (minutes or tens of minutes) and > then return an hour or so later. > > We killed the snaprestores during one of the outages and observed a > full recovery of the NFS service. It may have been coincidental. > > Anyone aware of snaprestore (specifically, single-file restores) > causing cascading impacts? > > OnCommand doesn't show any additional spike in CPU, disk activity, > etc.... > > Thanks, > Ray

John Stoffel

18 Sep 18 Sep

1:31 p.m.

I've also run into performance problems on 8.1.2 (no patch) in 7-mode, but when running a FPolicy connected to a CommVault Archiver modules, so that stub files get re-hydrated properly. If we hit it too hard with too many requests, the NFS performance goes into the crapper. When you don't remember you have FPolicy enabled (because it just works usually) and there's nothing in the logs or in sysstat showing why NFS is pausing or just stopping, a reboot has worked.

But when it happens a second time in a weekend, you start thinking and remembering more and turning off FPolicy brings performance back in an instant. I suspect we're both running into the same type of issue, where NFS gets pushed down the priority stack, or it just has too small a buffer to handle outstanding NFSv3 requests or something.

John

4218

Age (days ago)

4220

Last active (days ago)

toasters@lists.teaparty.net

9 comments

5 participants

tags (0)

participants (5)

Fletcher Cocquyt
John Stoffel
Jordan Slingerland
Parisi, Justin
Ray Van Dolson