We use a 6030 series with software disk ownership. According to
the documentation below this option does not apply to us. Is that right?
Please note that this was a intermittent Loop A failure and not
a complete failure. So Loop A kept going up/down and we had no failover during
this period.
Suresh
From: Coatney, Sue
[mailto:Sue.Coatney@netapp.com]
Sent: Friday, January 22, 2010 2:02 PM
To: LOhit; Suresh Rajagopalan
Cc: toasters@mathworks.com
Subject: RE: Loop A failure not triggering failover
The cf.takeover.on_disk_shelf_miscompare option needs to be turned
on for takeover to happen when a disk shelf mis-compare happens.
Sue
Coatney
High
Availability Team
NetApp
From: LOhit [mailto:lohit.b@gmail.com]
Sent: Thu 1/21/2010 11:09 PM
To: Suresh Rajagopalan
Cc: toasters@mathworks.com
Subject: Re: Loop A failure not triggering failover
Hi Suresh,
I think this should have happened, when the loop failed. (Taken from ONTAP
docs)
Describes the way a node uses disk shelf comparison with its partner node to
determine if it is impaired.
When communication between nodes is first established through the cluster
interconnect adapters, the nodes exchange a list of disk shelves that are
visible on the A and B loops of each node. If, later, a system sees that the B
loop disk shelf count on its partner is greater than its local A loop disk
shelf count, the system concludes that it is impaired and prompts its partner
to initiate a takeover.
Note:
Disk shelf comparison does not function for active/active configurations using
software-based disk ownership, or for fabric-attached MetroClusters.
options cf.takeover.detection.seconds
number_of_seconds (But, I think this affects only cluster interconnect timeouts
not the loop failure)
The valid values for number_of_seconds are 10 through 180; the default is 15.
Attention: If the
specified time is less than 15 seconds, unnecessary takeovers can occur, and a
core might not be generated for some system panics. Use caution when assigning
a takeover time of less than 15 seconds.
On Fri, Jan 22, 2010 at 11:55 AM, Suresh Rajagopalan <SRajagopalan@williamoneil.com>
wrote:
We have a active/active setup on our
filers,standard loop A/loop B cabling (no multipath HA).
We had a recent event with our filers
where intermittent failure of loop A did not trigger a failover to the
partner. I’d like to know why that is the case. According to the
Netapp failover cause and effect document at
This event should have caused a
failover.
The log message from the filer on loop A
was:
Sun Jan 17 15:41:56 PST [netapp1:
fci.link.break:error]: Link break detected on Fibre Channel adapter 0e.
Is there a option or timeout
setting to make the failover happen
Thanks
Suresh
--
LOhit