Hi All,
We extensively use NetApp API calls to monitor 7Mode filers, and took the same approach for cDOT monitoring.
Here is a very unpleasant discovery:
1. Take one node (or more nodes) *offline*, eg power it off for maintenance. 2. Try to run *any* API call against cluster interface and get the following error:
OUTPUT: <results reason="RPC: Port mapper failure - RPC: Timed out" status="failed" errno="13001"></results>
It effectively makes your cluster wide monitoring useless.
Any ideas? Is it a feature or a bug?
Cheers, Vladimir
Did the cluster interface failover correctly? can the host doing the API calls "ssh" into the cluster address?
--tmac
*Tim McCarthy, **Principal Consultant*
On Wed, Mar 30, 2016 at 8:34 AM, Momonth momonth@gmail.com wrote:
Hi All,
We extensively use NetApp API calls to monitor 7Mode filers, and took the same approach for cDOT monitoring.
Here is a very unpleasant discovery:
- Take one node (or more nodes) *offline*, eg power it off for
maintenance. 2. Try to run *any* API call against cluster interface and get the following error:
OUTPUT: <results reason="RPC: Port mapper failure - RPC: Timed out" status="failed" errno="13001"></results>
It effectively makes your cluster wide monitoring useless.
Any ideas? Is it a feature or a bug?
Cheers, Vladimir _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Try narrowing your API call to a specific node. It’s possible it’s trying to query the node that’s down and causing the timeout.
API might not be smart enough to know to ignore a node that is not up.
Also be sure to check that it did fail over properly as tmac mentioned. And that the cluster is in quorum. (set diag; cluster show; cluster ring show)
From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of tmac Sent: Wednesday, March 30, 2016 8:59 AM To: Vladimir Zhigulin Cc: toasters@teaparty.net Subject: Re: NetApp SDK for cDOT: any API call fails if a cluster node is not available
Did the cluster interface failover correctly? can the host doing the API calls "ssh" into the cluster address?
--tmac
Tim McCarthy, Principal Consultant
On Wed, Mar 30, 2016 at 8:34 AM, Momonth <momonth@gmail.commailto:momonth@gmail.com> wrote: Hi All,
We extensively use NetApp API calls to monitor 7Mode filers, and took the same approach for cDOT monitoring.
Here is a very unpleasant discovery:
1. Take one node (or more nodes) *offline*, eg power it off for maintenance. 2. Try to run *any* API call against cluster interface and get the following error:
OUTPUT: <results reason="RPC: Port mapper failure - RPC: Timed out" status="failed" errno="13001"></results>
It effectively makes your cluster wide monitoring useless.
Any ideas? Is it a feature or a bug?
Cheers, Vladimir _______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
A correction to my initial state:
1. I have the whole HA-pair (ie two nodes) being powered off.
On Wed, Mar 30, 2016 at 3:30 PM, Parisi, Justin Justin.Parisi@netapp.com wrote:
Try narrowing your API call to a specific node. It’s possible it’s trying to query the node that’s down and causing the timeout.
I initially noticed this behavior with "diagnosis-alert-get-iter" call, which doesn't require a node parameter. But even simple thing like "version" fails.
API might not be smart enough to know to ignore a node that is not up.
The reality proves otherwise =) I'm on 8.3.1.
Also be sure to check that it did fail over properly as tmac mentioned. And that the cluster is in quorum. (set diag; cluster show; cluster ring show)
Since both nodes are down, there was actually no failover taking place.
Here is what I get:
cdot::*> cluster ring show .. <output of healthy nodes here> ..
Warning: Unable to list entries on node na101node-4a. RPC: Port mapper failure - RPC: Timed out Unable to list entries on node na101node-4b. RPC: Port mapper failure - RPC: Timed out 30 entries were displayed.
Oh, well that's different entirely. :)
The cluster may be out of quorum, which is causing this issue.
Did you capture the aforementioned commands?
"RPC timeout" here means that the API is being sent across the cluster to other nodes via RPC. Since the nodes are down, the commands are failing.
Keep in mind that a scenario where two nodes in a cluster are powered off is not a normal scenario. If you are doing maintenance, you would want to mark those nodes as "eligibility false" to ensure they don't participate in the cluster during maintenance. You also want to ensure epsilon is not on the nodes and to move epsilon if it is.
-----Original Message----- From: vladimir.zhigulin@gmail.com [mailto:vladimir.zhigulin@gmail.com] On Behalf Of Momonth Sent: Wednesday, March 30, 2016 10:27 AM To: Parisi, Justin Cc: NGC-tmacmd-gmail.com; toasters@teaparty.net Subject: Re: NetApp SDK for cDOT: any API call fails if a cluster node is not available
A correction to my initial state:
1. I have the whole HA-pair (ie two nodes) being powered off.
On Wed, Mar 30, 2016 at 3:30 PM, Parisi, Justin Justin.Parisi@netapp.com wrote:
Try narrowing your API call to a specific node. It’s possible it’s trying to query the node that’s down and causing the timeout.
I initially noticed this behavior with "diagnosis-alert-get-iter" call, which doesn't require a node parameter. But even simple thing like "version" fails.
API might not be smart enough to know to ignore a node that is not up.
The reality proves otherwise =) I'm on 8.3.1.
Also be sure to check that it did fail over properly as tmac mentioned. And that the cluster is in quorum. (set diag; cluster show; cluster ring show)
Since both nodes are down, there was actually no failover taking place.
Here is what I get:
cdot::*> cluster ring show .. <output of healthy nodes here> ..
Warning: Unable to list entries on node na101node-4a. RPC: Port mapper failure - RPC: Timed out Unable to list entries on node na101node-4b. RPC: Port mapper failure - RPC: Timed out 30 entries were displayed.
On Wed, Mar 30, 2016 at 4:31 PM, Parisi, Justin Justin.Parisi@netapp.com wrote:
Oh, well that's different entirely. :)
true =)
The cluster may be out of quorum, which is causing this issue.
I don't think so, my cluster is 6 nodes, "-" 2 nodes being powered off, "+" one of the running nodes has got "epsilon", IMO, it's still "quorum".
Did you capture the aforementioned commands?
"RPC timeout" here means that the API is being sent across the cluster to other nodes via RPC. Since the nodes are down, the commands are failing.
Yes, i did, the healthy nodes reply just fine.
Node Health Eligibility Epsilon -------------------- ------- ------------ ------------ na101node-1a true true false na101node-1b true true false na101node-2a true true true na101node-2b true true false na101node-3a true true false na101node-3b true true false na101node-4a false true false na101node-4b false true false
Keep in mind that a scenario where two nodes in a cluster are powered off is not a normal scenario. If you are doing maintenance, you would want to mark those nodes as "eligibility false" to ensure they don't participate in the cluster during maintenance. You also want to ensure epsilon is not on the nodes and to move epsilon if it is.
Well, I just tried to set "eligibility" to false, but it didn't fix the API calls issue:
::*> node modify -node na101node-4* -eligibility false
Warning: When a node's eligibility is set to "false," it cannot serve SAN data, and NAS access might also be affected. This setting should be used only for unusual maintenance operations. To restore the node's data-serving capabilities, set the eligibility to "true" and reboot the node. Continue? {y|n}: y 2 entries were modified.
Well I'd suggest opening a support case and getting a bug filed.
While this scenario is odd, the APIs should be smart enough to ignore nodes that are unhealthy/ineligible.
-----Original Message----- From: vladimir.zhigulin@gmail.com [mailto:vladimir.zhigulin@gmail.com] On Behalf Of Momonth Sent: Wednesday, March 30, 2016 10:42 AM To: Parisi, Justin Cc: NGC-tmacmd-gmail.com; toasters@teaparty.net Subject: Re: NetApp SDK for cDOT: any API call fails if a cluster node is not available
On Wed, Mar 30, 2016 at 4:31 PM, Parisi, Justin Justin.Parisi@netapp.com wrote:
Oh, well that's different entirely. :)
true =)
The cluster may be out of quorum, which is causing this issue.
I don't think so, my cluster is 6 nodes, "-" 2 nodes being powered off, "+" one of the running nodes has got "epsilon", IMO, it's still "quorum".
Did you capture the aforementioned commands?
"RPC timeout" here means that the API is being sent across the cluster to other nodes via RPC. Since the nodes are down, the commands are failing.
Yes, i did, the healthy nodes reply just fine.
Node Health Eligibility Epsilon -------------------- ------- ------------ ------------ na101node-1a true true false na101node-1b true true false na101node-2a true true true na101node-2b true true false na101node-3a true true false na101node-3b true true false na101node-4a false true false na101node-4b false true false
Keep in mind that a scenario where two nodes in a cluster are powered off is not a normal scenario. If you are doing maintenance, you would want to mark those nodes as "eligibility false" to ensure they don't participate in the cluster during maintenance. You also want to ensure epsilon is not on the nodes and to move epsilon if it is.
Well, I just tried to set "eligibility" to false, but it didn't fix the API calls issue:
::*> node modify -node na101node-4* -eligibility false
Warning: When a node's eligibility is set to "false," it cannot serve SAN data, and NAS access might also be affected. This setting should be used only for unusual maintenance operations. To restore the node's data-serving capabilities, set the eligibility to "true" and reboot the node. Continue? {y|n}: y 2 entries were modified.
Yes, I do agree it's worth a ticket ..
I think the following CLI command relies on the same API call "internally":
::*> system health alert show This table is currently empty.
Warning: Unable to list entries on node na101node-4a. RPC: Port mapper failure - RPC: Timed out Unable to list entries on node na101node-4b. RPC: Port mapper failure - RPC: Timed out
On Wed, Mar 30, 2016 at 4:47 PM, Parisi, Justin Justin.Parisi@netapp.com wrote:
Well I'd suggest opening a support case and getting a bug filed.
While this scenario is odd, the APIs should be smart enough to ignore nodes that are unhealthy/ineligible.
I dug into this a bit more, and it seems that not every single API call fails, but just some, eg:
system-get-version - works version - fails
Cheers, Vladimir
On Wed, Mar 30, 2016 at 4:51 PM, Momonth momonth@gmail.com wrote:
Yes, I do agree it's worth a ticket ..
I think the following CLI command relies on the same API call "internally":
::*> system health alert show This table is currently empty.
Warning: Unable to list entries on node na101node-4a. RPC: Port mapper failure - RPC: Timed out Unable to list entries on node na101node-4b. RPC: Port mapper failure - RPC: Timed out
On Wed, Mar 30, 2016 at 4:47 PM, Parisi, Justin Justin.Parisi@netapp.com wrote:
Well I'd suggest opening a support case and getting a bug filed.
While this scenario is odd, the APIs should be smart enough to ignore nodes that are unhealthy/ineligible.
Yes, the cluster interface is SSH-able, i can with no issues.
On Wed, Mar 30, 2016 at 2:59 PM, tmac tmacmd@gmail.com wrote:
Did the cluster interface failover correctly? can the host doing the API calls "ssh" into the cluster address?
--tmac
Tim McCarthy, Principal Consultant
On Wed, Mar 30, 2016 at 8:34 AM, Momonth momonth@gmail.com wrote:
Hi All,
We extensively use NetApp API calls to monitor 7Mode filers, and took the same approach for cDOT monitoring.
Here is a very unpleasant discovery:
- Take one node (or more nodes) *offline*, eg power it off for
maintenance. 2. Try to run *any* API call against cluster interface and get the following error:
OUTPUT: <results reason="RPC: Port mapper failure - RPC: Timed out" status="failed" errno="13001"></results>
It effectively makes your cluster wide monitoring useless.
Any ideas? Is it a feature or a bug?
Cheers, Vladimir _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters