Hi Jeffrey and others,
I don't want to hijack this thread, since this is specifically about the repl_throttle_enable flag, but are you guys aware of the performance impact on SnapMirror when the transfers run over etherchannels with port-based hashing on the sender side ?
I have come across this on a couple of times (First time I encountered this I logged a case for it: 2005111796). Unforunately I have never had the time to troubleshoot this. In case 2005111796, support observed packet loss in the setup with port-based hashing, but we had to destroy our (test/troubleshooting) setup before we could get to the bottom of this. Since then, I have come across this on several occasions. More often than not it was not a real issue since those SnapMirrors ran across WAN links, or SnapMirror runs at night and can take all the time it wants, but on 1Gbps/10Gbps LANs where SM updates need to be fast, it is an issue. However, I found out there is a TR that mentions that SnapMirror performance could be impacted by port-based ifgrps so I've never bothered to open any additional cases for this.
Can anyone else confirm this behavior ?
(To put in my two cents on the repl_throttle_enable flag: at a customer today he reported this SnapMirror progress with/without throttling: 300GB in 2 hours vs. 100GB in 15 minutes after we disabled this flag. Also, earlier this week I had to wait for 160 TB of vol move operations on a 2nd line system. When disabling the repl_throttle_enable flag, I saw little or no impact for volumes with "dead/unmodified" data on it, but a big impact for (NFS) VMware datastores with some live VMs sitting on them: the cutover estimation from "vol move show" was reduced by 24 hours almost immediately - I am quite sure those VMs will have been impacted as CPU and disk load was pegged at 90+%).
Best regards, Filip
On Thu, Feb 2, 2017 at 9:13 AM, Steiner, Jeffrey <Jeffrey.Steiner@netapp.com
wrote:
If anyone on this distribution list runs into unexplained slow snapmirror transfers, please open a support case and cite BURT 1030457. It sounds like under some circumstances we don't fully understand, the throttle is too aggressive. Post-processing deduplication jobs seem to be connected, but there's probably more to it than just that.
I've tagged the BURT with the support cases mentioned so far in this thread, and requested a better KB article explaining when this flag might need to be updated.
-----Original Message----- From: Tim Parkinson [mailto:t.r.parkinson@sheffield.ac.uk] Sent: Wednesday, February 01, 2017 6:37 AM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com Cc: toasters@teaparty.net Subject: Re: super secret flags
Hi Jeffrey,
Just adding another voice to the "We've experienced abysmal snapmirror performance in cmode" crowd. We've never really had a satisfactory answer to why from our third party support people/netapp and have spent a tremendous amount of time trying to track down the cause of snapmirror issues (including buying larger controllers). This is the first we've heard of this throttle setting, and will certainly test it over a weekend to see if it helps us out, since we still see lagging mirrors and can't work out why.
We have a large number of post-process deduped volumes, no compression, to answer your question.
Regards,
Tim
On 31 January 2017 at 07:30, Steiner, Jeffrey Jeffrey.Steiner@netapp.com wrote:
Thanks for all the feedback, this definitely appears to be a gap. This
parameter wasn't intended to be required outside edge cases, but it seems that "edge cases" is way too narrow.
I have a question - what is your use of post-processing compression or
deduplication?
There seems to be a few other cases where a lot of post-processing work
was creating contention with snapmirror operations. Without going into too much detail, they both run as lower-priority tasks to ensure they don't interfere with "real" work like host IO operations.
If that's really the context then we need to update the KB article so
nobody else ends up chasing a network or disk latency problem that doesn't exist. I'd imagine there could be other lower-priority tasks that could disproportionately mess with snapmirror transfer rates too.
-----Original Message----- From: Peter D. Gray [mailto:pdg@uow.edu.au] Sent: Monday, January 30, 2017 11:52 PM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com Cc: NGC-pdg-uow.edu.au pdg@uow.edu.au; toasters@teaparty.net Subject: Re: super secret flags
On Mon, Jan 30, 2017 at 06:13:22AM +0000, Steiner, Jeffrey wrote:
I scanned the documentation on this flag, and it's not a universally
applicable setting. It should only be set in conjunction with a support case to address an identified issue. In general, it should only be set as a temporary measure, but there are exceptions to that general rule.
I am not entirely convinced that every customer should need to raise a
support case to get their snapmirrors working properly.
On the whole, that issue appears to be related to transfer latency.
That could be the latency of a slow network or the latency resulting from a network with a problem, such as packet loss. I'd imagine it could be also caused by latency imposed by an overloaded destination SATA aggregate as well, plus it's not out of the question that something newer like 40Gb Ethernet might create some kind of odd issue that warrants setting this flag.
Hmmm.... we have a pretty good network. And its hard to believe our disk
latency at 1AM is a problem. As I said, we got a factor of 10 in terms of snapmirror performance, and no noticeable drop in filer performance at either end.
But as I said elsewhere, it should be my choice how I prioritize
performance over data protection. Give me the tools and the documentation.
In normal practice, you shouldn't need to touch this parameter. I've
been around a long time, and I'd never heard of it before now, and I've never used it with any of my lab setups, and I rely on SnapMirror heavily.
Did not work here.
The important thing is not to use this option unless directed by the
support center. There's a risk of masking the underlying problem, or creating new problems.
Hmmmm...... you could be right. But on the other hand we spent 3 weeks
of our time looking at this problem only to be told about a really simple fix that seems to work a treat.
You can see that does not make us happy.
You might consider continuing to follow up on the case to ensure that either (a) you're in an odd situation where this parameter really is warranted or (b) there is some kind of underlying problem that needs fixing. If you're otherwise happy with the way the system is performing
and the parameter change worked, I'd probably call it good...
Not after 3 weeks of my time and other peoples time spent chasing a
non-existant network problem.
The thing that made me the most angry is that there is a completely
undocumented setting that has an absolutely massive impact on performance of a major feature in ONTAP.
Basically, I posted this to see if any other people have seen the
problem.
It appears at least some have.
Regards, pdg
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
-- Tim Parkinson Server & Storage Administrator University of Sheffield 0114 222 3039
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters