Hi people
Just out of idle curiosity, am I the only netapp admin who does not know about the super secret flags to allow snapmirror to actually work at reasonable speed?
We were running 8.3.2 cluster mode, and spent weeks looking into why our snapmirrors to our remote site ran so slowly. We were often 2 days behind over 40G networks. Obviously, we focussed on network issues. And we wasted a lot of time. We could make no sense of the problem at all since sometimes it appears to work ok, the later the transfers slowed to a crawl.
We eventually opened a case and it did not take to long for a reply which basically said "why don't you just disable the global snapmirror throttle." I had already looked into such a beast, but found nothing.
As you may or may not know, it turns out to be a per node setting. The name of the flag is repl_throttle_enable. Of course, you can only see such flags or change them on the node, in privileged mode.
Setting the flag to 0 immediately (and I do mean immediately) allowed our snapmirrors to run at the speed you might expect over 40G. Instead of taking 2 days, snapmirror updates now took 2 hours.
We have since upgraded to 9.1. The flags reverted to on, but again can be set to off. I think there is a documented global snapmirror throttle option in 9.1, but I have not looked into that yet.
Are we the only site in the world to have seen this issue? We use snapmirror DR for all our mirrors which may be a factor.
As I said, just idle curiousity and maybe helping someone avoid the time wasting we had.
Regards, pdg
Peter Gray Ph (direct): +61 2 4221 3770 Information Management & Technology Services Ph (switch): +61 2 4221 3555 University of Wollongong Fax: +61 2 4229 1958 Wollongong NSW 2522 Email: pdg@uow.edu.au Australia URL: http://pdg.uow.edu.au
I scanned the documentation on this flag, and it's not a universally applicable setting. It should only be set in conjunction with a support case to address an identified issue. In general, it should only be set as a temporary measure, but there are exceptions to that general rule.
On the whole, that issue appears to be related to transfer latency. That could be the latency of a slow network or the latency resulting from a network with a problem, such as packet loss. I'd imagine it could be also caused by latency imposed by an overloaded destination SATA aggregate as well, plus it's not out of the question that something newer like 40Gb Ethernet might create some kind of odd issue that warrants setting this flag.
In normal practice, you shouldn't need to touch this parameter. I've been around a long time, and I'd never heard of it before now, and I've never used it with any of my lab setups, and I rely on SnapMirror heavily.
The important thing is not to use this option unless directed by the support center. There's a risk of masking the underlying problem, or creating new problems.
You might consider continuing to follow up on the case to ensure that either (a) you're in an odd situation where this parameter really is warranted or (b) there is some kind of underlying problem that needs fixing. If you're otherwise happy with the way the system is performing and the parameter change worked, I'd probably call it good...
-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Peter D. Gray Sent: Monday, January 30, 2017 12:30 AM To: toasters@teaparty.net Subject: super secret flags
Hi people
Just out of idle curiosity, am I the only netapp admin who does not know about the super secret flags to allow snapmirror to actually work at reasonable speed?
We were running 8.3.2 cluster mode, and spent weeks looking into why our snapmirrors to our remote site ran so slowly. We were often 2 days behind over 40G networks. Obviously, we focussed on network issues. And we wasted a lot of time. We could make no sense of the problem at all since sometimes it appears to work ok, the later the transfers slowed to a crawl.
We eventually opened a case and it did not take to long for a reply which basically said "why don't you just disable the global snapmirror throttle." I had already looked into such a beast, but found nothing.
As you may or may not know, it turns out to be a per node setting. The name of the flag is repl_throttle_enable. Of course, you can only see such flags or change them on the node, in privileged mode.
Setting the flag to 0 immediately (and I do mean immediately) allowed our snapmirrors to run at the speed you might expect over 40G. Instead of taking 2 days, snapmirror updates now took 2 hours.
We have since upgraded to 9.1. The flags reverted to on, but again can be set to off. I think there is a documented global snapmirror throttle option in 9.1, but I have not looked into that yet.
Are we the only site in the world to have seen this issue? We use snapmirror DR for all our mirrors which may be a factor.
As I said, just idle curiousity and maybe helping someone avoid the time wasting we had.
Regards, pdg
Peter Gray Ph (direct): +61 2 4221 3770 Information Management & Technology Services Ph (switch): +61 2 4221 3555 University of Wollongong Fax: +61 2 4229 1958 Wollongong NSW 2522 Email: pdg@uow.edu.au Australia URL: http://pdg.uow.edu.au _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
We routinely have to turn off throttle to get reasonable throughput on SnapMirrors and vol moves. I know of other customers that leave it off permanently. We try to leave at default.
My understanding is that with throttling off, you can potentially impact performance of client IO. However, we've never had any complaints.
Doug Clendening
-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey Sent: Monday, January 30, 2017 12:13 AM To: NGC-pdg-uow.edu.au; toasters@teaparty.net Subject: [**EXTERNAL**] RE: super secret flags
I scanned the documentation on this flag, and it's not a universally applicable setting. It should only be set in conjunction with a support case to address an identified issue. In general, it should only be set as a temporary measure, but there are exceptions to that general rule.
On the whole, that issue appears to be related to transfer latency. That could be the latency of a slow network or the latency resulting from a network with a problem, such as packet loss. I'd imagine it could be also caused by latency imposed by an overloaded destination SATA aggregate as well, plus it's not out of the question that something newer like 40Gb Ethernet might create some kind of odd issue that warrants setting this flag.
In normal practice, you shouldn't need to touch this parameter. I've been around a long time, and I'd never heard of it before now, and I've never used it with any of my lab setups, and I rely on SnapMirror heavily.
The important thing is not to use this option unless directed by the support center. There's a risk of masking the underlying problem, or creating new problems.
You might consider continuing to follow up on the case to ensure that either (a) you're in an odd situation where this parameter really is warranted or (b) there is some kind of underlying problem that needs fixing. If you're otherwise happy with the way the system is performing and the parameter change worked, I'd probably call it good...
-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Peter D. Gray Sent: Monday, January 30, 2017 12:30 AM To: toasters@teaparty.net Subject: super secret flags
Hi people
Just out of idle curiosity, am I the only netapp admin who does not know about the super secret flags to allow snapmirror to actually work at reasonable speed?
We were running 8.3.2 cluster mode, and spent weeks looking into why our snapmirrors to our remote site ran so slowly. We were often 2 days behind over 40G networks. Obviously, we focussed on network issues. And we wasted a lot of time. We could make no sense of the problem at all since sometimes it appears to work ok, the later the transfers slowed to a crawl.
We eventually opened a case and it did not take to long for a reply which basically said "why don't you just disable the global snapmirror throttle." I had already looked into such a beast, but found nothing.
As you may or may not know, it turns out to be a per node setting. The name of the flag is repl_throttle_enable. Of course, you can only see such flags or change them on the node, in privileged mode.
Setting the flag to 0 immediately (and I do mean immediately) allowed our snapmirrors to run at the speed you might expect over 40G. Instead of taking 2 days, snapmirror updates now took 2 hours.
We have since upgraded to 9.1. The flags reverted to on, but again can be set to off. I think there is a documented global snapmirror throttle option in 9.1, but I have not looked into that yet.
Are we the only site in the world to have seen this issue? We use snapmirror DR for all our mirrors which may be a factor.
As I said, just idle curiousity and maybe helping someone avoid the time wasting we had.
Regards, pdg
Peter Gray Ph (direct): +61 2 4221 3770 Information Management & Technology Services Ph (switch): +61 2 4221 3555 University of Wollongong Fax: +61 2 4229 1958 Wollongong NSW 2522 Email: pdg@uow.edu.au Australia URL: http://pdg.uow.edu.au _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Thanks for the note. I'm working on a number of DR-related projects that rely on SnapMirror. I'll be looking into this more closely. If I find out anything useful, I'll relay it. If this is something that everyone will need to adjust, so be it, but it still seems odd to me that it would be required outside fringe cases.
-----Original Message----- From: Clendening, William D [mailto:Doug.Clendening@chevron.com] Sent: Monday, January 30, 2017 1:57 PM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com; NGC-pdg-uow.edu.au pdg@uow.edu.au; toasters@teaparty.net Subject: RE: super secret flags
We routinely have to turn off throttle to get reasonable throughput on SnapMirrors and vol moves. I know of other customers that leave it off permanently. We try to leave at default.
My understanding is that with throttling off, you can potentially impact performance of client IO. However, we've never had any complaints.
Doug Clendening
-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Steiner, Jeffrey Sent: Monday, January 30, 2017 12:13 AM To: NGC-pdg-uow.edu.au; toasters@teaparty.net Subject: [**EXTERNAL**] RE: super secret flags
I scanned the documentation on this flag, and it's not a universally applicable setting. It should only be set in conjunction with a support case to address an identified issue. In general, it should only be set as a temporary measure, but there are exceptions to that general rule.
On the whole, that issue appears to be related to transfer latency. That could be the latency of a slow network or the latency resulting from a network with a problem, such as packet loss. I'd imagine it could be also caused by latency imposed by an overloaded destination SATA aggregate as well, plus it's not out of the question that something newer like 40Gb Ethernet might create some kind of odd issue that warrants setting this flag.
In normal practice, you shouldn't need to touch this parameter. I've been around a long time, and I'd never heard of it before now, and I've never used it with any of my lab setups, and I rely on SnapMirror heavily.
The important thing is not to use this option unless directed by the support center. There's a risk of masking the underlying problem, or creating new problems.
You might consider continuing to follow up on the case to ensure that either (a) you're in an odd situation where this parameter really is warranted or (b) there is some kind of underlying problem that needs fixing. If you're otherwise happy with the way the system is performing and the parameter change worked, I'd probably call it good...
-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Peter D. Gray Sent: Monday, January 30, 2017 12:30 AM To: toasters@teaparty.net Subject: super secret flags
Hi people
Just out of idle curiosity, am I the only netapp admin who does not know about the super secret flags to allow snapmirror to actually work at reasonable speed?
We were running 8.3.2 cluster mode, and spent weeks looking into why our snapmirrors to our remote site ran so slowly. We were often 2 days behind over 40G networks. Obviously, we focussed on network issues. And we wasted a lot of time. We could make no sense of the problem at all since sometimes it appears to work ok, the later the transfers slowed to a crawl.
We eventually opened a case and it did not take to long for a reply which basically said "why don't you just disable the global snapmirror throttle." I had already looked into such a beast, but found nothing.
As you may or may not know, it turns out to be a per node setting. The name of the flag is repl_throttle_enable. Of course, you can only see such flags or change them on the node, in privileged mode.
Setting the flag to 0 immediately (and I do mean immediately) allowed our snapmirrors to run at the speed you might expect over 40G. Instead of taking 2 days, snapmirror updates now took 2 hours.
We have since upgraded to 9.1. The flags reverted to on, but again can be set to off. I think there is a documented global snapmirror throttle option in 9.1, but I have not looked into that yet.
Are we the only site in the world to have seen this issue? We use snapmirror DR for all our mirrors which may be a factor.
As I said, just idle curiousity and maybe helping someone avoid the time wasting we had.
Regards, pdg
Peter Gray Ph (direct): +61 2 4221 3770 Information Management & Technology Services Ph (switch): +61 2 4221 3555 University of Wollongong Fax: +61 2 4229 1958 Wollongong NSW 2522 Email: pdg@uow.edu.au Australia URL: http://pdg.uow.edu.au _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
On Mon, Jan 30, 2017 at 06:13:22AM +0000, Steiner, Jeffrey wrote:
I scanned the documentation on this flag, and it's not a universally applicable setting. It should only be set in conjunction with a support case to address an identified issue. In general, it should only be set as a temporary measure, but there are exceptions to that general rule.
I am not entirely convinced that every customer should need to raise a support case to get their snapmirrors working properly.
On the whole, that issue appears to be related to transfer latency. That could be the latency of a slow network or the latency resulting from a network with a problem, such as packet loss. I'd imagine it could be also caused by latency imposed by an overloaded destination SATA aggregate as well, plus it's not out of the question that something newer like 40Gb Ethernet might create some kind of odd issue that warrants setting this flag.
Hmmm.... we have a pretty good network. And its hard to believe our disk latency at 1AM is a problem. As I said, we got a factor of 10 in terms of snapmirror performance, and no noticeable drop in filer performance at either end.
But as I said elsewhere, it should be my choice how I prioritize performance over data protection. Give me the tools and the documentation.
In normal practice, you shouldn't need to touch this parameter. I've been around a long time, and I'd never heard of it before now, and I've never used it with any of my lab setups, and I rely on SnapMirror heavily.
Did not work here.
The important thing is not to use this option unless directed by the support center. There's a risk of masking the underlying problem, or creating new problems.
Hmmmm...... you could be right. But on the other hand we spent 3 weeks of our time looking at this problem only to be told about a really simple fix that seems to work a treat.
You can see that does not make us happy.
You might consider continuing to follow up on the case to ensure that either (a) you're in an odd situation where this parameter really is warranted or (b) there is some kind of underlying problem that needs fixing. If you're otherwise happy with the way the system is performing and the parameter change worked, I'd probably call it good...
Not after 3 weeks of my time and other peoples time spent chasing a non-existant network problem. The thing that made me the most angry is that there is a completely undocumented setting that has an absolutely massive impact on performance of a major feature in ONTAP.
Basically, I posted this to see if any other people have seen the problem. It appears at least some have.
Regards, pdg
Thanks for all the feedback, this definitely appears to be a gap. This parameter wasn't intended to be required outside edge cases, but it seems that "edge cases" is way too narrow.
I have a question - what is your use of post-processing compression or deduplication?
There seems to be a few other cases where a lot of post-processing work was creating contention with snapmirror operations. Without going into too much detail, they both run as lower-priority tasks to ensure they don't interfere with "real" work like host IO operations.
If that's really the context then we need to update the KB article so nobody else ends up chasing a network or disk latency problem that doesn't exist. I'd imagine there could be other lower-priority tasks that could disproportionately mess with snapmirror transfer rates too.
-----Original Message----- From: Peter D. Gray [mailto:pdg@uow.edu.au] Sent: Monday, January 30, 2017 11:52 PM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com Cc: NGC-pdg-uow.edu.au pdg@uow.edu.au; toasters@teaparty.net Subject: Re: super secret flags
On Mon, Jan 30, 2017 at 06:13:22AM +0000, Steiner, Jeffrey wrote:
I scanned the documentation on this flag, and it's not a universally applicable setting. It should only be set in conjunction with a support case to address an identified issue. In general, it should only be set as a temporary measure, but there are exceptions to that general rule.
I am not entirely convinced that every customer should need to raise a support case to get their snapmirrors working properly.
On the whole, that issue appears to be related to transfer latency. That could be the latency of a slow network or the latency resulting from a network with a problem, such as packet loss. I'd imagine it could be also caused by latency imposed by an overloaded destination SATA aggregate as well, plus it's not out of the question that something newer like 40Gb Ethernet might create some kind of odd issue that warrants setting this flag.
Hmmm.... we have a pretty good network. And its hard to believe our disk latency at 1AM is a problem. As I said, we got a factor of 10 in terms of snapmirror performance, and no noticeable drop in filer performance at either end.
But as I said elsewhere, it should be my choice how I prioritize performance over data protection. Give me the tools and the documentation.
In normal practice, you shouldn't need to touch this parameter. I've been around a long time, and I'd never heard of it before now, and I've never used it with any of my lab setups, and I rely on SnapMirror heavily.
Did not work here.
The important thing is not to use this option unless directed by the support center. There's a risk of masking the underlying problem, or creating new problems.
Hmmmm...... you could be right. But on the other hand we spent 3 weeks of our time looking at this problem only to be told about a really simple fix that seems to work a treat.
You can see that does not make us happy.
You might consider continuing to follow up on the case to ensure that either (a) you're in an odd situation where this parameter really is warranted or (b) there is some kind of underlying problem that needs fixing. If you're otherwise happy with the way the system is performing and the parameter change worked, I'd probably call it good...
Not after 3 weeks of my time and other peoples time spent chasing a non-existant network problem. The thing that made me the most angry is that there is a completely undocumented setting that has an absolutely massive impact on performance of a major feature in ONTAP.
Basically, I posted this to see if any other people have seen the problem. It appears at least some have.
Regards, pdg
Hi Jeffrey Not sure if you were addressing Peter specifically or the list in general, but I'll answer none-the-less :)
I have 23 entries in 'volume efficiency show -state enabled', and of those, 14 are inline-only policy on SSD. None of those are being SnapMirrored. Of the remaining 9, 4 of them are being SnapMirrored to our DR. Of those 4, 1 of them is a constantly lagged 20t volume previously mentioned. The other constantly lagged volumes are not being deduped or compressed.
My rough guesstimate is that disabling this global throttle and leaving the SnapMirrors running overnight has transferred more snapshot data from these lagged volumes in less than 24 hours than in total from the previous month or maybe more. The tradeoff I am seeing is increased node CPU utilization as well as an occasional small uptick in latency to NFS clients.
I wonder if as filer hardware gets more powerful and enough evidence has been brought to light, this global throttle can be disabled by default or at least not be as aggressive - the snapmirror throughput increase is so significant that if I didn't see it myself I'd guess that someone was reading numbers incorrectly.
Ian Ehrenwald Senior Infrastructure Engineer Hachette Book Group, Inc. 1.617.263.1948 / ian.ehrenwald@hbgusa.com
________________________________________ From: toasters-bounces@teaparty.net toasters-bounces@teaparty.net on behalf of Steiner, Jeffrey Jeffrey.Steiner@netapp.com Sent: Tuesday, January 31, 2017 2:30:35 AM To: NGC-pdg-uow.edu.au Cc: toasters@teaparty.net Subject: RE: super secret flags
Thanks for all the feedback, this definitely appears to be a gap. This parameter wasn't intended to be required outside edge cases, but it seems that "edge cases" is way too narrow.
I have a question - what is your use of post-processing compression or deduplication?
There seems to be a few other cases where a lot of post-processing work was creating contention with snapmirror operations. Without going into too much detail, they both run as lower-priority tasks to ensure they don't interfere with "real" work like host IO operations.
If that's really the context then we need to update the KB article so nobody else ends up chasing a network or disk latency problem that doesn't exist. I'd imagine there could be other lower-priority tasks that could disproportionately mess with snapmirror transfer rates too. This may contain confidential material. If you are not an intended recipient, please notify the sender, delete immediately, and understand that no disclosure or reliance on the information herein is permitted. Hachette Book Group may monitor email to and from our network.
Hi Jeffrey,
Just adding another voice to the "We've experienced abysmal snapmirror performance in cmode" crowd. We've never really had a satisfactory answer to why from our third party support people/netapp and have spent a tremendous amount of time trying to track down the cause of snapmirror issues (including buying larger controllers). This is the first we've heard of this throttle setting, and will certainly test it over a weekend to see if it helps us out, since we still see lagging mirrors and can't work out why.
We have a large number of post-process deduped volumes, no compression, to answer your question.
Regards,
Tim
On 31 January 2017 at 07:30, Steiner, Jeffrey Jeffrey.Steiner@netapp.com wrote:
Thanks for all the feedback, this definitely appears to be a gap. This parameter wasn't intended to be required outside edge cases, but it seems that "edge cases" is way too narrow.
I have a question - what is your use of post-processing compression or deduplication?
There seems to be a few other cases where a lot of post-processing work was creating contention with snapmirror operations. Without going into too much detail, they both run as lower-priority tasks to ensure they don't interfere with "real" work like host IO operations.
If that's really the context then we need to update the KB article so nobody else ends up chasing a network or disk latency problem that doesn't exist. I'd imagine there could be other lower-priority tasks that could disproportionately mess with snapmirror transfer rates too.
-----Original Message----- From: Peter D. Gray [mailto:pdg@uow.edu.au] Sent: Monday, January 30, 2017 11:52 PM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com Cc: NGC-pdg-uow.edu.au pdg@uow.edu.au; toasters@teaparty.net Subject: Re: super secret flags
On Mon, Jan 30, 2017 at 06:13:22AM +0000, Steiner, Jeffrey wrote:
I scanned the documentation on this flag, and it's not a universally applicable setting. It should only be set in conjunction with a support case to address an identified issue. In general, it should only be set as a temporary measure, but there are exceptions to that general rule.
I am not entirely convinced that every customer should need to raise a support case to get their snapmirrors working properly.
On the whole, that issue appears to be related to transfer latency. That could be the latency of a slow network or the latency resulting from a network with a problem, such as packet loss. I'd imagine it could be also caused by latency imposed by an overloaded destination SATA aggregate as well, plus it's not out of the question that something newer like 40Gb Ethernet might create some kind of odd issue that warrants setting this flag.
Hmmm.... we have a pretty good network. And its hard to believe our disk latency at 1AM is a problem. As I said, we got a factor of 10 in terms of snapmirror performance, and no noticeable drop in filer performance at either end.
But as I said elsewhere, it should be my choice how I prioritize performance over data protection. Give me the tools and the documentation.
In normal practice, you shouldn't need to touch this parameter. I've been around a long time, and I'd never heard of it before now, and I've never used it with any of my lab setups, and I rely on SnapMirror heavily.
Did not work here.
The important thing is not to use this option unless directed by the support center. There's a risk of masking the underlying problem, or creating new problems.
Hmmmm...... you could be right. But on the other hand we spent 3 weeks of our time looking at this problem only to be told about a really simple fix that seems to work a treat.
You can see that does not make us happy.
You might consider continuing to follow up on the case to ensure that either (a) you're in an odd situation where this parameter really is warranted or (b) there is some kind of underlying problem that needs fixing. If you're otherwise happy with the way the system is performing and the parameter change worked, I'd probably call it good...
Not after 3 weeks of my time and other peoples time spent chasing a non-existant network problem. The thing that made me the most angry is that there is a completely undocumented setting that has an absolutely massive impact on performance of a major feature in ONTAP.
Basically, I posted this to see if any other people have seen the problem. It appears at least some have.
Regards, pdg
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Thanks for all the replies, I'm going to bring this up with engineering and ask for clearer guidance. I suspect we really need an updated KB at a minimum. If this particular problem arises because there's legitimately a lot of "extra" work going on, then this is just another tunable that needs to be documented. On the other hand, if something like post-processing dedupe is abnormally outcompeting SnapMirror, that's a bug that ought to be fixed.
It might take a week, but I'll report back on what I find.
-----Original Message----- From: Tim Parkinson [mailto:t.r.parkinson@sheffield.ac.uk] Sent: Wednesday, February 01, 2017 6:37 AM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com Cc: toasters@teaparty.net Subject: Re: super secret flags
Hi Jeffrey,
Just adding another voice to the "We've experienced abysmal snapmirror performance in cmode" crowd. We've never really had a satisfactory answer to why from our third party support people/netapp and have spent a tremendous amount of time trying to track down the cause of snapmirror issues (including buying larger controllers). This is the first we've heard of this throttle setting, and will certainly test it over a weekend to see if it helps us out, since we still see lagging mirrors and can't work out why.
We have a large number of post-process deduped volumes, no compression, to answer your question.
Regards,
Tim
On 31 January 2017 at 07:30, Steiner, Jeffrey Jeffrey.Steiner@netapp.com wrote:
Thanks for all the feedback, this definitely appears to be a gap. This parameter wasn't intended to be required outside edge cases, but it seems that "edge cases" is way too narrow.
I have a question - what is your use of post-processing compression or deduplication?
There seems to be a few other cases where a lot of post-processing work was creating contention with snapmirror operations. Without going into too much detail, they both run as lower-priority tasks to ensure they don't interfere with "real" work like host IO operations.
If that's really the context then we need to update the KB article so nobody else ends up chasing a network or disk latency problem that doesn't exist. I'd imagine there could be other lower-priority tasks that could disproportionately mess with snapmirror transfer rates too.
-----Original Message----- From: Peter D. Gray [mailto:pdg@uow.edu.au] Sent: Monday, January 30, 2017 11:52 PM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com Cc: NGC-pdg-uow.edu.au pdg@uow.edu.au; toasters@teaparty.net Subject: Re: super secret flags
On Mon, Jan 30, 2017 at 06:13:22AM +0000, Steiner, Jeffrey wrote:
I scanned the documentation on this flag, and it's not a universally applicable setting. It should only be set in conjunction with a support case to address an identified issue. In general, it should only be set as a temporary measure, but there are exceptions to that general rule.
I am not entirely convinced that every customer should need to raise a support case to get their snapmirrors working properly.
On the whole, that issue appears to be related to transfer latency. That could be the latency of a slow network or the latency resulting from a network with a problem, such as packet loss. I'd imagine it could be also caused by latency imposed by an overloaded destination SATA aggregate as well, plus it's not out of the question that something newer like 40Gb Ethernet might create some kind of odd issue that warrants setting this flag.
Hmmm.... we have a pretty good network. And its hard to believe our disk latency at 1AM is a problem. As I said, we got a factor of 10 in terms of snapmirror performance, and no noticeable drop in filer performance at either end.
But as I said elsewhere, it should be my choice how I prioritize performance over data protection. Give me the tools and the documentation.
In normal practice, you shouldn't need to touch this parameter. I've been around a long time, and I'd never heard of it before now, and I've never used it with any of my lab setups, and I rely on SnapMirror heavily.
Did not work here.
The important thing is not to use this option unless directed by the support center. There's a risk of masking the underlying problem, or creating new problems.
Hmmmm...... you could be right. But on the other hand we spent 3 weeks of our time looking at this problem only to be told about a really simple fix that seems to work a treat.
You can see that does not make us happy.
You might consider continuing to follow up on the case to ensure that either (a) you're in an odd situation where this parameter really is warranted or (b) there is some kind of underlying problem that needs fixing. If you're otherwise happy with the way the system is performing and the parameter change worked, I'd probably call it good...
Not after 3 weeks of my time and other peoples time spent chasing a non-existant network problem. The thing that made me the most angry is that there is a completely undocumented setting that has an absolutely massive impact on performance of a major feature in ONTAP.
Basically, I posted this to see if any other people have seen the problem. It appears at least some have.
Regards, pdg
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
-- Tim Parkinson Server & Storage Administrator University of Sheffield 0114 222 3039
If anyone on this distribution list runs into unexplained slow snapmirror transfers, please open a support case and cite BURT 1030457. It sounds like under some circumstances we don't fully understand, the throttle is too aggressive. Post-processing deduplication jobs seem to be connected, but there's probably more to it than just that.
I've tagged the BURT with the support cases mentioned so far in this thread, and requested a better KB article explaining when this flag might need to be updated.
-----Original Message----- From: Tim Parkinson [mailto:t.r.parkinson@sheffield.ac.uk] Sent: Wednesday, February 01, 2017 6:37 AM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com Cc: toasters@teaparty.net Subject: Re: super secret flags
Hi Jeffrey,
Just adding another voice to the "We've experienced abysmal snapmirror performance in cmode" crowd. We've never really had a satisfactory answer to why from our third party support people/netapp and have spent a tremendous amount of time trying to track down the cause of snapmirror issues (including buying larger controllers). This is the first we've heard of this throttle setting, and will certainly test it over a weekend to see if it helps us out, since we still see lagging mirrors and can't work out why.
We have a large number of post-process deduped volumes, no compression, to answer your question.
Regards,
Tim
On 31 January 2017 at 07:30, Steiner, Jeffrey Jeffrey.Steiner@netapp.com wrote:
Thanks for all the feedback, this definitely appears to be a gap. This parameter wasn't intended to be required outside edge cases, but it seems that "edge cases" is way too narrow.
I have a question - what is your use of post-processing compression or deduplication?
There seems to be a few other cases where a lot of post-processing work was creating contention with snapmirror operations. Without going into too much detail, they both run as lower-priority tasks to ensure they don't interfere with "real" work like host IO operations.
If that's really the context then we need to update the KB article so nobody else ends up chasing a network or disk latency problem that doesn't exist. I'd imagine there could be other lower-priority tasks that could disproportionately mess with snapmirror transfer rates too.
-----Original Message----- From: Peter D. Gray [mailto:pdg@uow.edu.au] Sent: Monday, January 30, 2017 11:52 PM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com Cc: NGC-pdg-uow.edu.au pdg@uow.edu.au; toasters@teaparty.net Subject: Re: super secret flags
On Mon, Jan 30, 2017 at 06:13:22AM +0000, Steiner, Jeffrey wrote:
I scanned the documentation on this flag, and it's not a universally applicable setting. It should only be set in conjunction with a support case to address an identified issue. In general, it should only be set as a temporary measure, but there are exceptions to that general rule.
I am not entirely convinced that every customer should need to raise a support case to get their snapmirrors working properly.
On the whole, that issue appears to be related to transfer latency. That could be the latency of a slow network or the latency resulting from a network with a problem, such as packet loss. I'd imagine it could be also caused by latency imposed by an overloaded destination SATA aggregate as well, plus it's not out of the question that something newer like 40Gb Ethernet might create some kind of odd issue that warrants setting this flag.
Hmmm.... we have a pretty good network. And its hard to believe our disk latency at 1AM is a problem. As I said, we got a factor of 10 in terms of snapmirror performance, and no noticeable drop in filer performance at either end.
But as I said elsewhere, it should be my choice how I prioritize performance over data protection. Give me the tools and the documentation.
In normal practice, you shouldn't need to touch this parameter. I've been around a long time, and I'd never heard of it before now, and I've never used it with any of my lab setups, and I rely on SnapMirror heavily.
Did not work here.
The important thing is not to use this option unless directed by the support center. There's a risk of masking the underlying problem, or creating new problems.
Hmmmm...... you could be right. But on the other hand we spent 3 weeks of our time looking at this problem only to be told about a really simple fix that seems to work a treat.
You can see that does not make us happy.
You might consider continuing to follow up on the case to ensure that either (a) you're in an odd situation where this parameter really is warranted or (b) there is some kind of underlying problem that needs fixing. If you're otherwise happy with the way the system is performing and the parameter change worked, I'd probably call it good...
Not after 3 weeks of my time and other peoples time spent chasing a non-existant network problem. The thing that made me the most angry is that there is a completely undocumented setting that has an absolutely massive impact on performance of a major feature in ONTAP.
Basically, I posted this to see if any other people have seen the problem. It appears at least some have.
Regards, pdg
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
-- Tim Parkinson Server & Storage Administrator University of Sheffield 0114 222 3039
Hi Jeffrey and others,
I don't want to hijack this thread, since this is specifically about the repl_throttle_enable flag, but are you guys aware of the performance impact on SnapMirror when the transfers run over etherchannels with port-based hashing on the sender side ?
I have come across this on a couple of times (First time I encountered this I logged a case for it: 2005111796). Unforunately I have never had the time to troubleshoot this. In case 2005111796, support observed packet loss in the setup with port-based hashing, but we had to destroy our (test/troubleshooting) setup before we could get to the bottom of this. Since then, I have come across this on several occasions. More often than not it was not a real issue since those SnapMirrors ran across WAN links, or SnapMirror runs at night and can take all the time it wants, but on 1Gbps/10Gbps LANs where SM updates need to be fast, it is an issue. However, I found out there is a TR that mentions that SnapMirror performance could be impacted by port-based ifgrps so I've never bothered to open any additional cases for this.
Can anyone else confirm this behavior ?
(To put in my two cents on the repl_throttle_enable flag: at a customer today he reported this SnapMirror progress with/without throttling: 300GB in 2 hours vs. 100GB in 15 minutes after we disabled this flag. Also, earlier this week I had to wait for 160 TB of vol move operations on a 2nd line system. When disabling the repl_throttle_enable flag, I saw little or no impact for volumes with "dead/unmodified" data on it, but a big impact for (NFS) VMware datastores with some live VMs sitting on them: the cutover estimation from "vol move show" was reduced by 24 hours almost immediately - I am quite sure those VMs will have been impacted as CPU and disk load was pegged at 90+%).
Best regards, Filip
On Thu, Feb 2, 2017 at 9:13 AM, Steiner, Jeffrey <Jeffrey.Steiner@netapp.com
wrote:
If anyone on this distribution list runs into unexplained slow snapmirror transfers, please open a support case and cite BURT 1030457. It sounds like under some circumstances we don't fully understand, the throttle is too aggressive. Post-processing deduplication jobs seem to be connected, but there's probably more to it than just that.
I've tagged the BURT with the support cases mentioned so far in this thread, and requested a better KB article explaining when this flag might need to be updated.
-----Original Message----- From: Tim Parkinson [mailto:t.r.parkinson@sheffield.ac.uk] Sent: Wednesday, February 01, 2017 6:37 AM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com Cc: toasters@teaparty.net Subject: Re: super secret flags
Hi Jeffrey,
Just adding another voice to the "We've experienced abysmal snapmirror performance in cmode" crowd. We've never really had a satisfactory answer to why from our third party support people/netapp and have spent a tremendous amount of time trying to track down the cause of snapmirror issues (including buying larger controllers). This is the first we've heard of this throttle setting, and will certainly test it over a weekend to see if it helps us out, since we still see lagging mirrors and can't work out why.
We have a large number of post-process deduped volumes, no compression, to answer your question.
Regards,
Tim
On 31 January 2017 at 07:30, Steiner, Jeffrey Jeffrey.Steiner@netapp.com wrote:
Thanks for all the feedback, this definitely appears to be a gap. This
parameter wasn't intended to be required outside edge cases, but it seems that "edge cases" is way too narrow.
I have a question - what is your use of post-processing compression or
deduplication?
There seems to be a few other cases where a lot of post-processing work
was creating contention with snapmirror operations. Without going into too much detail, they both run as lower-priority tasks to ensure they don't interfere with "real" work like host IO operations.
If that's really the context then we need to update the KB article so
nobody else ends up chasing a network or disk latency problem that doesn't exist. I'd imagine there could be other lower-priority tasks that could disproportionately mess with snapmirror transfer rates too.
-----Original Message----- From: Peter D. Gray [mailto:pdg@uow.edu.au] Sent: Monday, January 30, 2017 11:52 PM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com Cc: NGC-pdg-uow.edu.au pdg@uow.edu.au; toasters@teaparty.net Subject: Re: super secret flags
On Mon, Jan 30, 2017 at 06:13:22AM +0000, Steiner, Jeffrey wrote:
I scanned the documentation on this flag, and it's not a universally
applicable setting. It should only be set in conjunction with a support case to address an identified issue. In general, it should only be set as a temporary measure, but there are exceptions to that general rule.
I am not entirely convinced that every customer should need to raise a
support case to get their snapmirrors working properly.
On the whole, that issue appears to be related to transfer latency.
That could be the latency of a slow network or the latency resulting from a network with a problem, such as packet loss. I'd imagine it could be also caused by latency imposed by an overloaded destination SATA aggregate as well, plus it's not out of the question that something newer like 40Gb Ethernet might create some kind of odd issue that warrants setting this flag.
Hmmm.... we have a pretty good network. And its hard to believe our disk
latency at 1AM is a problem. As I said, we got a factor of 10 in terms of snapmirror performance, and no noticeable drop in filer performance at either end.
But as I said elsewhere, it should be my choice how I prioritize
performance over data protection. Give me the tools and the documentation.
In normal practice, you shouldn't need to touch this parameter. I've
been around a long time, and I'd never heard of it before now, and I've never used it with any of my lab setups, and I rely on SnapMirror heavily.
Did not work here.
The important thing is not to use this option unless directed by the
support center. There's a risk of masking the underlying problem, or creating new problems.
Hmmmm...... you could be right. But on the other hand we spent 3 weeks
of our time looking at this problem only to be told about a really simple fix that seems to work a treat.
You can see that does not make us happy.
You might consider continuing to follow up on the case to ensure that either (a) you're in an odd situation where this parameter really is warranted or (b) there is some kind of underlying problem that needs fixing. If you're otherwise happy with the way the system is performing
and the parameter change worked, I'd probably call it good...
Not after 3 weeks of my time and other peoples time spent chasing a
non-existant network problem.
The thing that made me the most angry is that there is a completely
undocumented setting that has an absolutely massive impact on performance of a major feature in ONTAP.
Basically, I posted this to see if any other people have seen the
problem.
It appears at least some have.
Regards, pdg
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
-- Tim Parkinson Server & Storage Administrator University of Sheffield 0114 222 3039
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
On Thu, Feb 02, 2017 at 03:41:19PM +0100, Filip Sneppe wrote:
Hi Jeffrey and others,
for (NFS) VMware datastores with some live VMs sitting on them: the cutover estimation from "vol move show" was reduced by 24 hours almost immediately
- I am quite sure those VMs will have been impacted as CPU and disk load
was pegged at 90+%).
You know about the -bypass-throttling true flag on the volume move command right? Again a super secret option oply available in diag mode. Without this, we find volume move takes forever.
This is our canned volume move command
set -privilege diag -confirmations off ';' volume move start -vserver "$svm" -volume "$volume" -destination-aggregate "$new_aggregate" \ -bypass-throttling true
Volume split also takes about a thousand years and I have not found a way to speed that up.
However, you can cheat by moving the new volume to another aggregate (as long as you disable the throttle). That effectively splits the volume but about 100 times quicker than waiting for the split to finish.
There seems to be a widespread issue with the builtin ONTAP throttles.
Regards, pdg
On Thu, Feb 02, 2017 at 03:41:19PM +0100, Filip Sneppe wrote:
Hi Jeffrey and others,
I have come across this on a couple of times (First time I encountered this I logged a case for it: 2005111796). Unforunately I have never had the time to troubleshoot this. In case 2005111796, support observed packet loss in the setup with port-based hashing, but we had to destroy our (test/troubleshooting) setup before we could get to the bottom of this. Since then, I have come across this on several occasions. More often than not it was not a real issue since those SnapMirrors ran across WAN links, or SnapMirror runs at night and can take all the time it wants, but on 1Gbps/10Gbps LANs where SM updates need to be fast, it is an issue. However, I found out there is a TR that mentions that SnapMirror performance could be impacted by port-based ifgrps so I've never bothered to open any additional cases for this.
Can anyone else confirm this behavior ?
Is this an LACP ifgrp? We have had no issues with LACP on cluster mode. On 7-mode, we saw many missed LACP packets, but like you never investigated fully because it kept working. One thing our network guys drum into us is that the LACP setting MUST agree at both ends.
Regards, pdg
We don't think the issue was caused by LACP. The difference between a configuration that replicates 'fast' or 'slow' was the distr-func set to IP (fast) or port (slow). In both situations we did use "multimode_lacp" as mode.
Regards,
Wouter Vervloesemhttps://be.linkedin.com/pub/wouter-vervloesem/5/a63/a41 Storage Consultant
Neoria NVhttp://www.neoria.be/ Prins Boudewijnlaan 41 - 2650 Edegem T +32 3 451 23 82 | M +32 496 52 93 61
Op 2 feb. 2017, om 23:06 heeft Peter D. Gray <pdg@uow.edu.aumailto:pdg@uow.edu.au> het volgende geschreven:
On Thu, Feb 02, 2017 at 03:41:19PM +0100, Filip Sneppe wrote: Hi Jeffrey and others,
I have come across this on a couple of times (First time I encountered this I logged a case for it: 2005111796). Unforunately I have never had the time to troubleshoot this. In case 2005111796, support observed packet loss in the setup with port-based hashing, but we had to destroy our (test/troubleshooting) setup before we could get to the bottom of this. Since then, I have come across this on several occasions. More often than not it was not a real issue since those SnapMirrors ran across WAN links, or SnapMirror runs at night and can take all the time it wants, but on 1Gbps/10Gbps LANs where SM updates need to be fast, it is an issue. However, I found out there is a TR that mentions that SnapMirror performance could be impacted by port-based ifgrps so I've never bothered to open any additional cases for this.
Can anyone else confirm this behavior ?
Is this an LACP ifgrp? We have had no issues with LACP on cluster mode. On 7-mode, we saw many missed LACP packets, but like you never investigated fully because it kept working. One thing our network guys drum into us is that the LACP setting MUST agree at both ends.
Regards, pdg
_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
There is no requirement for the LACP hashing configuration to be the same on both sides. On the whole, it doesn't make any difference if there's a mismatch.
The important thing that a lot of people miss is that LACP distribution policies are controlled by the sending device. There is no negotiation. For example, you can have ONTAP using IP hashing, while the switch is using src-dst-MAC hashing. That might be a bad idea, such as with a routed environment where only 2 MAC addresses are talking, but it doesn't create a compatibility problem.
I've seen a few older switches that really don't like port hashing. I'm not sure exactly what's happening, but it seemed like the architecture of the switch wasn't expecting the same IP/MAC to appear on different multiple ports. It would pass traffic, but the CPU utilization jumped up significantly when any kind of port hashing was being used. Changing to IP solved the problem.
From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Vervloesem Wouter Sent: Friday, February 03, 2017 9:15 AM To: NGC-pdg-uow.edu.au pdg@uow.edu.au Cc: toasters@teaparty.net Subject: Re: super secret flags
We don't think the issue was caused by LACP. The difference between a configuration that replicates 'fast' or 'slow' was the distr-func set to IP (fast) or port (slow). In both situations we did use "multimode_lacp" as mode.
Regards,
Wouter Vervloesemhttps://be.linkedin.com/pub/wouter-vervloesem/5/a63/a41 Storage Consultant
Neoria NVhttp://www.neoria.be/ Prins Boudewijnlaan 41 - 2650 Edegem T +32 3 451 23 82 | M +32 496 52 93 61
Op 2 feb. 2017, om 23:06 heeft Peter D. Gray <pdg@uow.edu.aumailto:pdg@uow.edu.au> het volgende geschreven:
On Thu, Feb 02, 2017 at 03:41:19PM +0100, Filip Sneppe wrote:
Hi Jeffrey and others,
I have come across this on a couple of times (First time I encountered this I logged a case for it: 2005111796). Unforunately I have never had the time to troubleshoot this. In case 2005111796, support observed packet loss in the setup with port-based hashing, but we had to destroy our (test/troubleshooting) setup before we could get to the bottom of this. Since then, I have come across this on several occasions. More often than not it was not a real issue since those SnapMirrors ran across WAN links, or SnapMirror runs at night and can take all the time it wants, but on 1Gbps/10Gbps LANs where SM updates need to be fast, it is an issue. However, I found out there is a TR that mentions that SnapMirror performance could be impacted by port-based ifgrps so I've never bothered to open any additional cases for this.
Can anyone else confirm this behavior ?
Is this an LACP ifgrp? We have had no issues with LACP on cluster mode. On 7-mode, we saw many missed LACP packets, but like you never investigated fully because it kept working. One thing our network guys drum into us is that the LACP setting MUST agree at both ends.
Regards, pdg
_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Hi Jeffrey,
In our case(s), the determining factor was that the distr-func was set to "port" on the sender/source side of the SnapMirror relationship. At the receiving end, this setting didn't matter. Yes, we are aware that the hashing algorithm does not need to be matched between both sides (including the hashing algorithm on the switch).
Also, I suspect it's not so much an LACP issue and we would probably have run into the same issue with a static multimode etherchannel too, although we've never tested this.
Before we had to break up our testing environment, we had tested and confirmed this behavior on Cisco 3750 and Nexus switches. Those aren't very exotic so we did worry about the performance drop.
ps. great thread by the way. Thanks Peter D. Gray for that other hidden flag in your reply :-)
Best regards, Filip
On Fri, Feb 3, 2017 at 9:24 AM, Steiner, Jeffrey <Jeffrey.Steiner@netapp.com
wrote:
There is no requirement for the LACP hashing configuration to be the same on both sides. On the whole, it doesn't make any difference if there's a mismatch.
The important thing that a lot of people miss is that LACP distribution policies are controlled by the sending device. There is no negotiation. For example, you can have ONTAP using IP hashing, while the switch is using src-dst-MAC hashing. That might be a bad idea, such as with a routed environment where only 2 MAC addresses are talking, but it doesn't create a compatibility problem.
I've seen a few older switches that really don't like port hashing. I'm not sure exactly what's happening, but it seemed like the architecture of the switch wasn't expecting the same IP/MAC to appear on different multiple ports. It would pass traffic, but the CPU utilization jumped up significantly when any kind of port hashing was being used. Changing to IP solved the problem.
*From:* toasters-bounces@teaparty.net [mailto:toasters-bounces@ teaparty.net] *On Behalf Of *Vervloesem Wouter *Sent:* Friday, February 03, 2017 9:15 AM *To:* NGC-pdg-uow.edu.au pdg@uow.edu.au *Cc:* toasters@teaparty.net *Subject:* Re: super secret flags
We don't think the issue was caused by LACP.
The difference between a configuration that replicates 'fast' or 'slow' was the distr-func set to IP (fast) or port (slow).
In both situations we did use "multimode_lacp" as mode.
Regards,
*Wouter Vervloesem* https://be.linkedin.com/pub/wouter-vervloesem/5/a63/a41 Storage Consultant
*Neoria NV* http://www.neoria.be/ Prins Boudewijnlaan 41 - 2650 Edegem *T* +32 3 451 23 82 | *M* +32 496 52 93 61 <+32%20496%2052%2093%2061>
Op 2 feb. 2017, om 23:06 heeft Peter D. Gray pdg@uow.edu.au het volgende geschreven:
On Thu, Feb 02, 2017 at 03:41:19PM +0100, Filip Sneppe wrote:
Hi Jeffrey and others,
I have come across this on a couple of times (First time I encountered this I logged a case for it: 2005111796). Unforunately I have never had the time to troubleshoot this. In case 2005111796, support observed packet loss in the setup with port-based hashing, but we had to destroy our (test/troubleshooting) setup before we could get to the bottom of this. Since then, I have come across this on several occasions. More often than not it was not a real issue since those SnapMirrors ran across WAN links, or SnapMirror runs at night and can take all the time it wants, but on 1Gbps/10Gbps LANs where SM updates need to be fast, it is an issue. However, I found out there is a TR that mentions that SnapMirror performance could be impacted by port-based ifgrps so I've never bothered to open any additional cases for this.
Can anyone else confirm this behavior ?
Is this an LACP ifgrp? We have had no issues with LACP on cluster mode. On 7-mode, we saw many missed LACP packets, but like you never investigated fully because it kept working. One thing our network guys drum into us is that the LACP setting MUST agree at both ends.
Regards, pdg
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
I almost mentioned that the Cisco 3750 switch was a model where I'd seen problems with port hashing. The 3750 has been around for a while, so it might have improved over time. I've also seen odd problems related to those four SFP ports on the right hand side of the switch.
A Nexus shouldn't have issues, unless vPC's are in use. There are some odd special settings to make vPC's play nice with LACP with some vendors including NetApp, and there have been a few vPC related bugs as well.
From: Filip Sneppe [mailto:filip.sneppe@gmail.com] Sent: Friday, February 03, 2017 10:28 AM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com Cc: NGC-wouter.vervloesem-neoria.be wouter.vervloesem@neoria.be; NGC-pdg-uow.edu.au pdg@uow.edu.au; toasters@teaparty.net Subject: Re: super secret flags
Hi Jeffrey,
In our case(s), the determining factor was that the distr-func was set to "port" on the sender/source side of the SnapMirror relationship. At the receiving end, this setting didn't matter. Yes, we are aware that the hashing algorithm does not need to be matched between both sides (including the hashing algorithm on the switch).
Also, I suspect it's not so much an LACP issue and we would probably have run into the same issue with a static multimode etherchannel too, although we've never tested this.
Before we had to break up our testing environment, we had tested and confirmed this behavior on Cisco 3750 and Nexus switches. Those aren't very exotic so we did worry about the performance drop.
ps. great thread by the way. Thanks Peter D. Gray for that other hidden flag in your reply :-)
Best regards, Filip
On Fri, Feb 3, 2017 at 9:24 AM, Steiner, Jeffrey <Jeffrey.Steiner@netapp.commailto:Jeffrey.Steiner@netapp.com> wrote: There is no requirement for the LACP hashing configuration to be the same on both sides. On the whole, it doesn't make any difference if there's a mismatch.
The important thing that a lot of people miss is that LACP distribution policies are controlled by the sending device. There is no negotiation. For example, you can have ONTAP using IP hashing, while the switch is using src-dst-MAC hashing. That might be a bad idea, such as with a routed environment where only 2 MAC addresses are talking, but it doesn't create a compatibility problem.
I've seen a few older switches that really don't like port hashing. I'm not sure exactly what's happening, but it seemed like the architecture of the switch wasn't expecting the same IP/MAC to appear on different multiple ports. It would pass traffic, but the CPU utilization jumped up significantly when any kind of port hashing was being used. Changing to IP solved the problem.
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net] On Behalf Of Vervloesem Wouter Sent: Friday, February 03, 2017 9:15 AM To: NGC-pdg-uow.edu.auhttp://NGC-pdg-uow.edu.au <pdg@uow.edu.aumailto:pdg@uow.edu.au> Cc: toasters@teaparty.netmailto:toasters@teaparty.net Subject: Re: super secret flags
We don't think the issue was caused by LACP. The difference between a configuration that replicates 'fast' or 'slow' was the distr-func set to IP (fast) or port (slow). In both situations we did use "multimode_lacp" as mode.
Regards,
Wouter Vervloesemhttps://be.linkedin.com/pub/wouter-vervloesem/5/a63/a41 Storage Consultant
Neoria NVhttp://www.neoria.be/ Prins Boudewijnlaan 41 - 2650 Edegem T +32 3 451 23 82 | M +32 496 52 93 61tel:+32%20496%2052%2093%2061
Op 2 feb. 2017, om 23:06 heeft Peter D. Gray <pdg@uow.edu.aumailto:pdg@uow.edu.au> het volgende geschreven:
On Thu, Feb 02, 2017 at 03:41:19PM +0100, Filip Sneppe wrote: Hi Jeffrey and others,
I have come across this on a couple of times (First time I encountered this I logged a case for it: 2005111796). Unforunately I have never had the time to troubleshoot this. In case 2005111796, support observed packet loss in the setup with port-based hashing, but we had to destroy our (test/troubleshooting) setup before we could get to the bottom of this. Since then, I have come across this on several occasions. More often than not it was not a real issue since those SnapMirrors ran across WAN links, or SnapMirror runs at night and can take all the time it wants, but on 1Gbps/10Gbps LANs where SM updates need to be fast, it is an issue. However, I found out there is a TR that mentions that SnapMirror performance could be impacted by port-based ifgrps so I've never bothered to open any additional cases for this.
Can anyone else confirm this behavior ?
Is this an LACP ifgrp? We have had no issues with LACP on cluster mode. On 7-mode, we saw many missed LACP packets, but like you never investigated fully because it kept working. One thing our network guys drum into us is that the LACP setting MUST agree at both ends.
Regards, pdg
_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Would you be able to share those odd special settings to make vPCs play nice with LACP and NetApp filers? Our datacenter sites have 2 x Nexus 5672's with vPCs between them, and the filers directly connected, so I want to run this by our network team to see if we're missing anything. AFAIK we have not seen any issues that would be a result of LACP problems, but I'm a curious cat.
Ian Ehrenwald Senior Infrastructure Engineer Hachette Book Group, Inc. 1.617.263.1948 / ian.ehrenwald@hbgusa.com
________________________________________ From: toasters-bounces@teaparty.net toasters-bounces@teaparty.net on behalf of Steiner, Jeffrey Jeffrey.Steiner@netapp.com Sent: Friday, February 3, 2017 4:34 AM To: Filip Sneppe Cc: toasters@teaparty.net Subject: RE: super secret flags
I almost mentioned that the Cisco 3750 switch was a model where I'd seen problems with port hashing. The 3750 has been around for a while, so it might have improved over time. I've also seen odd problems related to those four SFP ports on the right hand side of the switch.
A Nexus shouldn't have issues, unless vPC's are in use. There are some odd special settings to make vPC's play nice with LACP with some vendors including NetApp, and there have been a few vPC related bugs as well. This may contain confidential material. If you are not an intended recipient, please notify the sender, delete immediately, and understand that no disclosure or reliance on the information herein is permitted. Hachette Book Group may monitor email to and from our network.
Do a search on the support site for "peer-gateway" and you'll see a number or articles, but the summary is just this:
Make sure vpc peer-gateway in in the port channel configuration, or the routing might not be what you think it is. You'll have traffic moving asymmetrically, and it can lead to weird ISL saturation situations.
I've been involved in a number of intermittent performance problems that ultimately resulted from this missing setting.
-----Original Message----- From: Ehrenwald, Ian [mailto:Ian.Ehrenwald@hbgusa.com] Sent: Friday, February 03, 2017 2:24 PM To: Steiner, Jeffrey Jeffrey.Steiner@netapp.com; Filip Sneppe filip.sneppe@gmail.com Cc: toasters@teaparty.net Subject: Re: super secret flags
Would you be able to share those odd special settings to make vPCs play nice with LACP and NetApp filers? Our datacenter sites have 2 x Nexus 5672's with vPCs between them, and the filers directly connected, so I want to run this by our network team to see if we're missing anything. AFAIK we have not seen any issues that would be a result of LACP problems, but I'm a curious cat.
Ian Ehrenwald Senior Infrastructure Engineer Hachette Book Group, Inc. 1.617.263.1948 / ian.ehrenwald@hbgusa.com
________________________________________ From: toasters-bounces@teaparty.net toasters-bounces@teaparty.net on behalf of Steiner, Jeffrey Jeffrey.Steiner@netapp.com Sent: Friday, February 3, 2017 4:34 AM To: Filip Sneppe Cc: toasters@teaparty.net Subject: RE: super secret flags
I almost mentioned that the Cisco 3750 switch was a model where I'd seen problems with port hashing. The 3750 has been around for a while, so it might have improved over time. I've also seen odd problems related to those four SFP ports on the right hand side of the switch.
A Nexus shouldn't have issues, unless vPC's are in use. There are some odd special settings to make vPC's play nice with LACP with some vendors including NetApp, and there have been a few vPC related bugs as well. This may contain confidential material. If you are not an intended recipient, please notify the sender, delete immediately, and understand that no disclosure or reliance on the information herein is permitted. Hachette Book Group may monitor email to and from our network.
It's funny you mention this - I had a support case open not more than a couple weeks ago regarding the same exact thing.
I have a few fairly large volumes (20t, 40t) that are consistently lagged in their SM replication by over a week to our DR site. The primary and DR site are connected via at 5Gb/s with and we've been able to fill that pipe in the past. The aggregate that holds these two volumes is made of 4 x DS2246 on both the source and destination side. The destination aggregate is mostly idle, the source aggregate sees almost 20K+ IOPS 24/7/365. I also have an aggregate made of 8 x DS2246 and it's pretty busy all the time too, and volumes on that aggregate replicate to an identical aggregate at the DR site and are never lagged.
The support engineer I was working with did mention that we could disable this global throttle though it may have an impact on client latency, so I didn't do it.
The best idea we could come up with is that the source side aggregate with the lagged SM volumes, and the node that owns it (a FAS8060), might be IOPS and CPU bound and we could consider adding more shelves to this aggregate, running a reallocate to spread the blocks around, and seeing if that helps.
It's not really in my budget at the moment to purchase four more DS2246 with 24 x 1.2t each (2 for primary, 2 for DR) so this has rekindled my interest in trying this global throttle flag on a weekend were if IO bogs down nobody will complain (too much) :)
-- Ian Ehrenwald Senior Infrastructure Engineer Hachette Book Group, Inc. 1.617.263.1948 / ian.ehrenwald@hbgusa.com
On 1/29/17, 6:29 PM, "Peter D. Gray" pdg@uow.edu.au wrote:
Hi people
Just out of idle curiosity, am I the only netapp admin who does not know about the super secret flags to allow snapmirror to actually work at reasonable speed?
We were running 8.3.2 cluster mode, and spent weeks looking into why our snapmirrors to our remote site ran so slowly. We were often 2 days behind over 40G networks. Obviously, we focussed on network issues. And we wasted a lot of time. We could make no sense of the problem at all since sometimes it appears to work ok, the later the transfers slowed to a crawl.
We eventually opened a case and it did not take to long for a reply which basically said "why don't you just disable the global snapmirror throttle." I had already looked into such a beast, but found nothing.
As you may or may not know, it turns out to be a per node setting. The name of the flag is repl_throttle_enable. Of course, you can only see such flags or change them on the node, in privileged mode.
Setting the flag to 0 immediately (and I do mean immediately) allowed our snapmirrors to run at the speed you might expect over 40G. Instead of taking 2 days, snapmirror updates now took 2 hours.
We have since upgraded to 9.1. The flags reverted to on, but again can be set to off. I think there is a documented global snapmirror throttle option in 9.1, but I have not looked into that yet.
Are we the only site in the world to have seen this issue? We use snapmirror DR for all our mirrors which may be a factor.
As I said, just idle curiousity and maybe helping someone avoid the time wasting we had.
Regards, pdg
Peter GrayPh (direct): +61 2 4221 3770 Information Management & Technology ServicesPh (switch): +61 2 4221 3555 University of WollongongFax: +61 2 4229 1958 Wollongong NSW 2522Email: pdg@uow.edu.au AustraliaURL: http://pdg.uow.edu.au _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
This may contain confidential material. If you are not an intended recipient, please notify the sender, delete immediately, and understand that no disclosure or reliance on the information herein is permitted. Hachette Book Group may monitor email to and from our network.
OK so curiosity got the better of me :) I just disabled this internal throttle and the lagged SnapMirrors went from about 150Mbit/s to 2.7Gbit/s according to our network monitoring tools. CPU utilization on the node that owns the disks has definitely increased, sometimes to the tune of 93% or higher, and latency across all volumes has ticked up by a small but measurable amount. Disk utilization % as measured by sysstat is still within a reasonable range. I do see a lot more CP activity, mostly :s :n and :f.
I understand why this throttle is enabled by default, and I would not keep this flag enabled because of HA CPU concerns, but the increase in SM throughput is unbelievable.
-- Ian Ehrenwald Senior Infrastructure Engineer Hachette Book Group, Inc. 1.617.263.1948 / ian.ehrenwald@hbgusa.com
On 1/30/17, 9:56 AM, "Ehrenwald, Ian" Ian.Ehrenwald@hbgusa.com wrote:
[This sender failed our fraud detection checks and may not be who they appear to be. Learn about spoofing at http://aka.ms/LearnAboutSpoofing]
It's funny you mention this - I had a support case open not more than a couple weeks ago regarding the same exact thing.
I have a few fairly large volumes (20t, 40t) that are consistently lagged in their SM replication by over a week to our DR site. The primary and DR site are connected via at 5Gb/s with and we've been able to fill that pipe in the past. The aggregate that holds these two volumes is made of 4 x DS2246 on both the source and destination side. The destination aggregate is mostly idle, the source aggregate sees almost 20K+ IOPS 24/7/365. I also have an aggregate made of 8 x DS2246 and it's pretty busy all the time too, and volumes on that aggregate replicate to an identical aggregate at the DR site and are never lagged.
The support engineer I was working with did mention that we could disable this global throttle though it may have an impact on client latency, so I didn't do it.
The best idea we could come up with is that the source side aggregate with the lagged SM volumes, and the node that owns it (a FAS8060), might be IOPS and CPU bound and we could consider adding more shelves to this aggregate, running a reallocate to spread the blocks around, and seeing if that helps.
It's not really in my budget at the moment to purchase four more DS2246 with 24 x 1.2t each (2 for primary, 2 for DR) so this has rekindled my interest in trying this global throttle flag on a weekend were if IO bogs down nobody will complain (too much) :)
-- Ian Ehrenwald Senior Infrastructure Engineer Hachette Book Group, Inc. 1.617.263.1948 / ian.ehrenwald@hbgusa.com
On 1/29/17, 6:29 PM, "Peter D. Gray" pdg@uow.edu.au wrote:
Hi people
Just out of idle curiosity, am I the only netapp admin who does not know about the super secret flags to allow snapmirror to actually work at reasonable speed?
We were running 8.3.2 cluster mode, and spent weeks looking into why our snapmirrors to our remote site ran so slowly. We were often 2 days behind over 40G networks. Obviously, we focussed on network issues. And we wasted a lot of time. We could make no sense of the problem at all since sometimes it appears to work ok, the later the transfers slowed to a crawl.
We eventually opened a case and it did not take to long for a reply which basically said "why don't you just disable the global snapmirror throttle." I had already looked into such a beast, but found nothing.
As you may or may not know, it turns out to be a per node setting. The name of the flag is repl_throttle_enable. Of course, you can only see such flags or change them on the node, in privileged mode.
Setting the flag to 0 immediately (and I do mean immediately) allowed our snapmirrors to run at the speed you might expect over 40G. Instead of taking 2 days, snapmirror updates now took 2 hours.
We have since upgraded to 9.1. The flags reverted to on, but again can be set to off. I think there is a documented global snapmirror throttle option in 9.1, but I have not looked into that yet.
Are we the only site in the world to have seen this issue? We use snapmirror DR for all our mirrors which may be a factor.
As I said, just idle curiousity and maybe helping someone avoid the time wasting we had.
Regards, pdg
Peter GrayPh (direct): +61 2 4221 3770 Information Management & Technology ServicesPh (switch): +61 2 4221 3555 University of WollongongFax: +61 2 4229 1958 Wollongong NSW 2522Email: pdg@uow.edu.au AustraliaURL: http://pdg.uow.edu.au _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
This may contain confidential material. If you are not an intended recipient, please notify the sender, delete immediately, and understand that no disclosure or reliance on the information herein is permitted. Hachette Book Group may monitor email to and from our network.
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
This may contain confidential material. If you are not an intended recipient, please notify the sender, delete immediately, and understand that no disclosure or reliance on the information herein is permitted. Hachette Book Group may monitor email to and from our network.
On Mon, Jan 30, 2017 at 02:56:48PM +0000, Ehrenwald, Ian wrote:
It's funny you mention this - I had a support case open not more than a couple weeks ago regarding the same exact thing.
It's not really in my budget at the moment to purchase four more DS2246 with 24 x 1.2t each (2 for primary, 2 for DR) so this has rekindled my interest in trying this global throttle flag on a weekend were if IO bogs down nobody will complain (too much) :)
We saw absolutely no impact on performance on the source filers. I am sure its possible it could impact performance.
In fact I would argue our performance may have improved, because the snapmirrors now finish in just a few hours overnight when we are under light load, rather than having 20 snapmirrors running during the day for days on end.
Also, I already have a mechanism to control throttling. Each snapmirror has a throttle setting. Whats the point if the global limit makes my per mirror limit useless?
Also, data protection can be more important than performance. That should be up to us to decide, not netapp.
The bottom line is that a global snapmirror rate throttle is a great idea. It should be documented and easily controllable by the administrator and should have a reasonable default value. It appears the current throttle has none of these things.
Regards, pdg