Thanks for the help, also sent this privately.
Sorry for delay in getting back to you.
When CF takeover is done it fails fine. When power is removed it errors out and does not take over. Then at command promt a cf forcetakeover will allow it to take over. We have tried this on 6 separate F7XX cluster with the same result. The only way we have found to fix this was to revert to 6.4.5 then it would work again. From there we are able to go to 6.5.1 and still works then 6.5.7 again and still working. Also checked all options to make sure the same on both filers. Here is the log with the errors.
main2> cf status
Cluster enabled, main1 is up.
Interconnect is up.
main1> cf status
Cluster enabled, main2 is up.
Interconnect is up
main1> Sun Dec 9 00:08:22 EST [main1: raid.rg.reparity.done:notice]: /vol0/pl ex0/Reg0: parity recomputation completed in 8:01.48
Sun Dec 9 00:08:54 EST [main1: cf.fsm.takeoverByPartnerDisabled:notice]: Cluster monitor: takeover of main1 by main2 disabled (unsynchronized log)
Sun Dec 9 00:08:55 EST [main1: cf.fsm.takeoverByPartnerDisabled:notice]: Cluster monitor: takeover of main1 by main2 disabled (interconnect error)
Sun Dec 9 00:09:02 EST [main1: cf.fsm.partnerNotResponding:notice]: Cluster monitor: partner not responding
Sun Dec 9 00:09:03 EST [main1: cf.fsm.takeoverCountdown:warning]: Cluster monitor: takeover scheduled in 10 seconds
Sun Dec 9 00:09:13 EST [main1: cf.fsm.firmwareExpiry:info]: Cluster monitor: firmware timeout expired on partner
Sun Dec 9 00:09:13 EST [main1: cf.fsm.stateTransit:warning]: Cluster monitor: UP --> TAKEOVER
Sun Dec 9 00:09:13 EST [main1: cf.fm.takeoverStarted:warning]: Cluster monitor: takeover started
Sun Dec 9 00:09:19 EST [main1: raid.vol.replay.nvram:info]: Performing raid replay on volume(s)
Sun Dec 9 00:09:19 EST [main1: raid.cksum.replay.summary:info]: Replayed 0 checksum blocks.
Sun Dec 9 00:09:19 EST [main1: raid.stripe.replay.summary:info]: Replayed 0 stripes.
Sun Dec 9 00:09:19 EST [main1: raid.fm.replayFail:error]: RAID takeover: raid replay failed with status 16
Sun Dec 9 00:09:19 EST [main1: raid.fm.takeoverFail:error]: RAID takeover failed: mirror consistency is required.
Sun Dec 9 00:09:19 EST [main1: cf.rsrc.takeoverFail:ALERT]: Cluster monitor: takeover during raid_replay failed; takeover cancelled
Sun Dec 9 00:09:19 EST [main1: cf.fm.takeoverFailed:error]: Cluster monitor: takeover failed 'unable to start partner'
Sun Dec 9 00:09:19 EST [main1: cf.fm.givebackStarted:warning]: Cluster monitor: giveback started
Sun Dec 9 00:09:20 EST [main1: cf.fm.givebackComplete:warning]: Cluster monitor: giv!
eback co
mpleted
Sun Dec 9 00:09:20 EST [main1: cf.fsm.stateTransit:warning]: Cluster monitor: TAKEOVER --> UP
Sun Dec 9 00:09:20 EST [main1: cf.fsm.takeoverByPartnerDisabled:notice]: Cluster monitor: takeover of main1 by main2 disabled (interconnect error)
Sun Dec 9 00:09:21 EST [main1: asup.smtp.sent:notice]: Cluster Notification mail sent: Cluster Notification from main1 (CLUSTER TAKEOVER FAILED) CRITICAL
Sun Dec 9 00:09:30 EST [main1: cf.fsm.partnerNotResponding:notice]: Cluster monitor: partner not responding
*disclaimer*
Berkcom.net does not have any personal or business relationship to berkcom.com or Berkeley Communications.
-------- Original Message --------
Subject: RE: Cluster bug
From: "Watanabe, Steve" <steve.watanabe@netapp.com>
Date: Tue, January 15, 2008 10:44 am
To: <admin@berkcom.net>, "NetApp list" <toasters@mathworks.com>
Hi Jack,
I send you a message last Friday asking for more information in order to see if I could help resolve your issue. As I haven't heard back I assume you've resolved the problem. I would be interested in hearing the details and how we could make the experience better. I am responding publicly as I wanted others on the list to know that your request wasn't be ignored....especially since you feel you've been left hanging.
If you're still experiencing problems, please gather console logs and I'll take a look.
Steve Watanabe
HA Infrastructure Group
I think I found a bug in 6.5.7 on the F7XX series filer. Has to do with Cluster. When a manual CF takeover or giveback is done both filers fail over to each other fine. When a true head unit failure happens (power removed) the filers wont fail to each other. We have spent days troubleshooting this issue and the only way to get cluster to work was to boot to the same hardware and configuration with 6.4.5. We are now testing with all versions of 6.5 family to see if it exists in all versions of this family. Anyone out there experience this issue? Anyone willing to test ont here own systems?
Yes, I know 6.5.7 is no longer supported, yes I know the F7XX series is no longer supported. Just wish Netapp would not break working code then leave you hanging. I understand having to move on but to break a code that has worked for YEARS then not fix it is a little dishearting. Anyone from Netapp willing to test there?
Jack
*disclaimer*
Berkcom.net does not have any personal or business relationship to berkcom.com or Berkeley Communications.