How long should it take a filer to fail over from one head to another? When I force a failover (cf forcetakeover) from one head the other goes down for minutes. Here's what I see on the console. This is a new filer with very little traffic going to it and FC is not even set up yet, all NFS/CIFS.
array01> cf forcetakeover
cf forcetakeover may lead to data corruption; really force a takeover? yes
cf: forcetakeover initiated by operator
array01> Mon May 5 12:27:48 EST [array01: cf.misc.operatorForcedTakeover:warning]: Cluster monitor: forced takeover initiated by operator
Mon May 5 12:27:48 EST [array01: cf.fsm.takeover.forced:info]: Cluster monitor: takeover attempted after cf forcetakeover command
Mon May 5 12:27:48 EST [array01: cf.fsm.stateTransit:warning]: Cluster monitor: UP --> TAKEOVER
Mon May 5 12:27:48 EST [array01: cf.fm.takeoverStarted:warning]: Cluster monitor: takeover started
Mon May 5 12:27:48 EST [array02/array01: coredump.spare.none:info]: No sparecore disk was found.
Mon May 5 12:27:51 EST [array01: raid.vol.replay.nvram:info]: Performing raid replay on volume(s)
Mon May 5 12:27:51 EST [array01: raid.cksum.replay.summary:info]: Replayed 0 checksum blocks.
Mon May 5 12:27:51 EST [array01: raid.stripe.replay.summary:info]: Replayed 0 stripes.
Mon May 5 12:27:54 EST [array02/array01: wafl.replay.done:info]: WAFL log replay completed, 2 seconds
ifconfig: no such media type <xxx>
media type options are: <tp> <tp-fd> <100tx> <100tx-fd> <1000fx> <auto> <10g-sr>
ifconfig: Unable to determine primary for interface e0a.
ifconfig: e0a: no such interface
ifconfig: Unable to determine primary for interface e0b.
ifconfig: e0b: no such interface
ifconfig: Unable to determine primary for interface e0c.
ifconfig: e0c: no such interface
ifconfig: Unable to determine primary for interface e0d.
ifconfig: e0d: no such interface
ifconfig: Unable to determine primary for interface e2a.
ifconfig: e2a: no such interface
ifconfig: Unable to determine primary for interface e2b.
ifconfig: e2b: no such interface
add net default: gateway 10.28.17.1: network unreachable
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0a.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0b.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0c.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0d.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e2a.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e2b.
Mon May 5 12:27:55 EST [array02/array01: nis.servers.not.available:error]: NIS server(s) not available.
Mon May 5 12:27:55 EST [array02/array01: cf_takeover:info]: relog syslog Mon May 5 12:26:00 EST [array02: monitor.globalStatus.ok:info]: The system's global status is normal.
Mon May 5 12:27:55 EST [array02/array01: cf_takeover:info]: relog syslog Mon May 5 12:27:47 EST [array02: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of array
There are 68 spare disks; you may want to use the vol or aggr command
to create new volumes or aggregates or add disks to the existing aggregate.
FCP service stopped.
Mon May 5 12:27:55 EST [array01: net.ifconfig.takeoverError:warning]: WARNING: 6 errors detected during network takeover processing WARNING: Some network clients may not be able to access the cluster during takeover
Mon May 5 12:27:55 EST [array01: cf.rsrc.takeoverOpFail:error]: Cluster monitor: takeover during ifconfig_2 failed; takeover continuing...
CIFS partner server is running.
Mon May 5 12:27:55 EST [array01 (takeover): cf.rsrc.transitTime:notice]: Top Takeover transit times wafl_replay=2383 {replay_log=2353, mark_replaying=29}, raid=832, rc=410 {hostname=51, ifconfig=46, options=23, options=14, options=10, options=9, ifconfig=1, ifconfig=1, ifconfig=1, route=1}, wafl=405, registry_postrc_phase1=227, raid_replay=179, registry_prerc=115, wafl_sync=74, fmdisk_reserve=70, cifs=70
Mon May 5 12:27:55 EST [array01 (takeover): cf.fm.takeoverComplete:warning]: Cluster monitor: takeover completed
Mon May 5 12:27:55 EST [array01 (takeover): cf.fm.takeoverDuration:warning]: Cluster monitor: takeover duration time is 7 seconds
Mon May 5 12:27:58 EST [array02/array01: asup.smtp.host:info]: Autosupport cannot connect to host smtp.danahermail.com (Network comm problem) for message: REBOOT (CLUSTER TAKEOVER)
Mon May 5 12:27:58 EST [array02/array01: asup.smtp.unreach:error]: Autosupport mail was not sent because the system cannot reach any of the mail hosts from the autosupport.mailhost option. (REBOOT (CLUSTER TAKEOVER))
Mon May 5 12:28:00 EST [array01 (takeover): monitor.globalStatus.critical:CRITICAL]: This node has taken over array02.
Mon May 5 12:28:00 EST [array02/array01: monitor.globalStatus.critical:CRITICAL]: array01 has taken over this node.
Mon May 5 12:28:05 EST [array02/array01: nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations have completed for the partner server.
Mon May 5 12:28:07 EST [array01 (takeover): asup.post.sent:notice]: Cluster Notification message posted to IBM: Cluster Notification from array01 (CLUSTER TAKEOVER COMPLETE MANUAL) INFO
Mon May 5 12:32:07 EST [array02/array01: asup.smtp.host:info]: Autosupport cannot connect to host smtp.danahermail.com (Network comm problem) for message: REBOOT (CLUSTER TAKEOVER)
Mon May 5 12:32:07 EST [array02/array01: asup.smtp.unreach:error]: Autosupport mail was not sent because the system cannot reach any of the mail hosts from the autosupport.mailhost option. (REBOOT (CLUSTER TAKEOVER))
Here is the ifconfig -a from the node that stayed up:
array01(takeover)> ifconfig -a
e0a: flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 02:a0:98:08:22:b7 (auto-1000t-fd-up) flowcontrol full
trunked lan0
e0b: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:a0:98:08:22:b6 (auto-unknown-cfg_down) flowcontrol full
e0c: flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 02:a0:98:08:22:b7 (auto-1000t-fd-up) flowcontrol full
trunked lan0
e0d: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:a0:98:08:22:b4 (auto-unknown-cfg_down) flowcontrol full
e2a: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:07:43:05:16:98 (auto-10g_sr-fd-cfg_down) flowcontrol full
e2b: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:07:43:05:16:99 (auto-10g_sr-fd-cfg_down) flowcontrol full
lo: flags=19e8049<UP,LOOPBACK,RUNNING,MULTICAST,MULTIHOST,PARTNER_UP,TCPCKSU M> mtu 8160
inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1
ether 00:00:00:00:00:00 (VIA Provider)
lan0: flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
inet 10.28.17.213 netmask 0xffffff00 broadcast 10.28.17.255
partner lan0 (not in use)
ether 02:a0:98:08:22:b7 (Enabled virtual interface)
This message (including any attachments) contains confidential and/or proprietary information intended only for the addressee. Any unauthorized disclosure, copying, distribution or reliance on the contents of this information is strictly prohibited and may constitute a violation of law. If you are not the intended recipient, please notify the sender immediately by responding to this e-mail, and delete the message from your system. If you have any questions about this e-mail please notify the sender immediately.
First, is there a reason you're using forcetakeover rather than just cf takeover? The later should do the trick. Second, in my experience, takeover time is much quicker certainly under 2 minutes, and in many case under 1. (Giveback is usually longer than takeover).
But it looks like you don't have your IP takeovers set up properly which could be causing delays. In your current config, I'd be surprised if any of your hosts could see the downed filer.
-- Adam Fox adamfox@netapp.com
________________________________
From: Page, Jeremy [mailto:jeremy.page@gilbarco.com] Sent: Monday, May 05, 2008 1:37 PM To: toasters@mathworks.com Subject: filer fail over times
How long should it take a filer to fail over from one head to another? When I force a failover (cf forcetakeover) from one head the other goes down for minutes. Here's what I see on the console. This is a new filer with very little traffic going to it and FC is not even set up yet, all NFS/CIFS.
array01> cf forcetakeover
cf forcetakeover may lead to data corruption; really force a takeover? yes
cf: forcetakeover initiated by operator
array01> Mon May 5 12:27:48 EST [array01: cf.misc.operatorForcedTakeover:warning]: Cluster monitor: forced takeover initiated by operator
Mon May 5 12:27:48 EST [array01: cf.fsm.takeover.forced:info]: Cluster monitor: takeover attempted after cf forcetakeover command
Mon May 5 12:27:48 EST [array01: cf.fsm.stateTransit:warning]: Cluster monitor: UP --> TAKEOVER
Mon May 5 12:27:48 EST [array01: cf.fm.takeoverStarted:warning]: Cluster monitor: takeover started
Mon May 5 12:27:48 EST [array02/array01: coredump.spare.none:info]: No sparecore disk was found.
Mon May 5 12:27:51 EST [array01: raid.vol.replay.nvram:info]: Performing raid replay on volume(s)
Mon May 5 12:27:51 EST [array01: raid.cksum.replay.summary:info]: Replayed 0 checksum blocks.
Mon May 5 12:27:51 EST [array01: raid.stripe.replay.summary:info]: Replayed 0 stripes.
Mon May 5 12:27:54 EST [array02/array01: wafl.replay.done:info]: WAFL log replay completed, 2 seconds
ifconfig: no such media type <xxx>
media type options are: <tp> <tp-fd> <100tx> <100tx-fd> <1000fx> <auto> <10g-sr>
ifconfig: Unable to determine primary for interface e0a.
ifconfig: e0a: no such interface
ifconfig: Unable to determine primary for interface e0b.
ifconfig: e0b: no such interface
ifconfig: Unable to determine primary for interface e0c.
ifconfig: e0c: no such interface
ifconfig: Unable to determine primary for interface e0d.
ifconfig: e0d: no such interface
ifconfig: Unable to determine primary for interface e2a.
ifconfig: e2a: no such interface
ifconfig: Unable to determine primary for interface e2b.
ifconfig: e2b: no such interface
add net default: gateway 10.28.17.1: network unreachable
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0a.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0b.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0c.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0d.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e2a.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e2b.
Mon May 5 12:27:55 EST [array02/array01: nis.servers.not.available:error]: NIS server(s) not available.
Mon May 5 12:27:55 EST [array02/array01: cf_takeover:info]: relog syslog Mon May 5 12:26:00 EST [array02: monitor.globalStatus.ok:info]: The system's global status is normal.
Mon May 5 12:27:55 EST [array02/array01: cf_takeover:info]: relog syslog Mon May 5 12:27:47 EST [array02: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of array
There are 68 spare disks; you may want to use the vol or aggr command
to create new volumes or aggregates or add disks to the existing aggregate.
FCP service stopped.
Mon May 5 12:27:55 EST [array01: net.ifconfig.takeoverError:warning]: WARNING: 6 errors detected during network takeover processing WARNING: Some network clients may not be able to access the cluster during takeover
Mon May 5 12:27:55 EST [array01: cf.rsrc.takeoverOpFail:error]: Cluster monitor: takeover during ifconfig_2 failed; takeover continuing...
CIFS partner server is running.
Mon May 5 12:27:55 EST [array01 (takeover): cf.rsrc.transitTime:notice]: Top Takeover transit times wafl_replay=2383 {replay_log=2353, mark_replaying=29}, raid=832, rc=410 {hostname=51, ifconfig=46, options=23, options=14, options=10, options=9, ifconfig=1, ifconfig=1, ifconfig=1, route=1}, wafl=405, registry_postrc_phase1=227, raid_replay=179, registry_prerc=115, wafl_sync=74, fmdisk_reserve=70, cifs=70
Mon May 5 12:27:55 EST [array01 (takeover): cf.fm.takeoverComplete:warning]: Cluster monitor: takeover completed
Mon May 5 12:27:55 EST [array01 (takeover): cf.fm.takeoverDuration:warning]: Cluster monitor: takeover duration time is 7 seconds
Mon May 5 12:27:58 EST [array02/array01: asup.smtp.host:info]: Autosupport cannot connect to host smtp.danahermail.com (Network comm problem) for message: REBOOT (CLUSTER TAKEOVER)
Mon May 5 12:27:58 EST [array02/array01: asup.smtp.unreach:error]: Autosupport mail was not sent because the system cannot reach any of the mail hosts from the autosupport.mailhost option. (REBOOT (CLUSTER TAKEOVER))
Mon May 5 12:28:00 EST [array01 (takeover): monitor.globalStatus.critical:CRITICAL]: This node has taken over array02.
Mon May 5 12:28:00 EST [array02/array01: monitor.globalStatus.critical:CRITICAL]: array01 has taken over this node.
Mon May 5 12:28:05 EST [array02/array01: nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations have completed for the partner server.
Mon May 5 12:28:07 EST [array01 (takeover): asup.post.sent:notice]: Cluster Notification message posted to IBM: Cluster Notification from array01 (CLUSTER TAKEOVER COMPLETE MANUAL) INFO
Mon May 5 12:32:07 EST [array02/array01: asup.smtp.host:info]: Autosupport cannot connect to host smtp.danahermail.com (Network comm problem) for message: REBOOT (CLUSTER TAKEOVER)
Mon May 5 12:32:07 EST [array02/array01: asup.smtp.unreach:error]: Autosupport mail was not sent because the system cannot reach any of the mail hosts from the autosupport.mailhost option. (REBOOT (CLUSTER TAKEOVER))
Here is the ifconfig -a from the node that stayed up:
array01(takeover)> ifconfig -a
e0a: flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 02:a0:98:08:22:b7 (auto-1000t-fd-up) flowcontrol full
trunked lan0
e0b: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:a0:98:08:22:b6 (auto-unknown-cfg_down) flowcontrol full
e0c: flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 02:a0:98:08:22:b7 (auto-1000t-fd-up) flowcontrol full
trunked lan0
e0d: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:a0:98:08:22:b4 (auto-unknown-cfg_down) flowcontrol full
e2a: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:07:43:05:16:98 (auto-10g_sr-fd-cfg_down) flowcontrol full
e2b: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:07:43:05:16:99 (auto-10g_sr-fd-cfg_down) flowcontrol full
lo: flags=19e8049<UP,LOOPBACK,RUNNING,MULTICAST,MULTIHOST,PARTNER_UP,TCPCKSU M> mtu 8160
inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1
ether 00:00:00:00:00:00 (VIA Provider)
lan0: flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
inet 10.28.17.213 netmask 0xffffff00 broadcast 10.28.17.255
partner lan0 (not in use)
ether 02:a0:98:08:22:b7 (Enabled virtual interface)
This message (including any attachments) contains confidential and/or proprietary information intended only for the addressee. Any unauthorized disclosure, copying, distribution or reliance on the contents of this information is strictly prohibited and may constitute a violation of law. If you are not the intended recipient, please notify the sender immediately by responding to this e-mail, and delete the message from your system. If you have any questions about this e-mail please notify the sender immediately.
Hey Jeremy,
On Mon, May 5, 2008 at 7:36 PM, Page, Jeremy jeremy.page@gilbarco.com wrote:
How long should it take a filer to fail over from one head to another? When I force a failover (cf forcetakeover) from one head the other goes down for minutes. Here's what I see on the console. This is a new filer with very little traffic going to it and FC is not even set up yet, all NFS/CIFS.
It should be seconds, how many seconds depends on the configuration and load of the filers at the moment of takeover.
array01> cf forcetakeover
cf forcetakeover may lead to data corruption; really force a takeover? yes
cf: forcetakeover initiated by operator
array01> Mon May 5 12:27:48 EST [array01: cf.misc.operatorForcedTakeover:warning]: Cluster monitor: forced takeover initiated by operator
Mon May 5 12:27:48 EST [array01: cf.fsm.takeover.forced:info]: Cluster monitor: takeover attempted after cf forcetakeover command
Mon May 5 12:27:48 EST [array01: cf.fsm.stateTransit:warning]: Cluster monitor: UP --> TAKEOVER
Mon May 5 12:27:48 EST [array01: cf.fm.takeoverStarted:warning]: Cluster monitor: takeover started
Mon May 5 12:27:48 EST [array02/array01: coredump.spare.none:info]: No sparecore disk was found.
Mon May 5 12:27:51 EST [array01: raid.vol.replay.nvram:info]: Performing raid replay on volume(s)
Mon May 5 12:27:51 EST [array01: raid.cksum.replay.summary:info]: Replayed 0 checksum blocks.
Mon May 5 12:27:51 EST [array01: raid.stripe.replay.summary:info]: Replayed 0 stripes.
Mon May 5 12:27:54 EST [array02/array01: wafl.replay.done:info]: WAFL log replay completed, 2 seconds
At this point the takeover is such, that both filers are running one one piece of hardware. Now the interface configuration begins.
ifconfig: no such media type <xxx>
media type options are: <tp> <tp-fd> <100tx> <100tx-fd> <1000fx>
<auto> <10g-sr>
ifconfig: Unable to determine primary for interface e0a.
ifconfig: e0a: no such interface
ifconfig: Unable to determine primary for interface e0b.
ifconfig: e0b: no such interface
ifconfig: Unable to determine primary for interface e0c.
ifconfig: e0c: no such interface
ifconfig: Unable to determine primary for interface e0d.
ifconfig: e0d: no such interface
ifconfig: Unable to determine primary for interface e2a.
ifconfig: e2a: no such interface
ifconfig: Unable to determine primary for interface e2b.
ifconfig: e2b: no such interface
add net default: gateway 10.28.17.1: network unreachable
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0a.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0b.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0c.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0d.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e2a.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e2b.
It would seem that your cluster setup for network interfaces isn't quite correct. Check your /etc/rc files and use the cluster config checker from the toolkit to identify problems.
Mon May 5 12:27:55 EST [array02/array01: nis.servers.not.available:error]: NIS server(s) not available.
Mon May 5 12:27:55 EST [array02/array01: cf_takeover:info]: relog syslog
Mon May 5 12:26:00 EST [array02: monitor.globalStatus.ok:info]: The system's global status is normal.
Mon May 5 12:27:55 EST [array02/array01: cf_takeover:info]: relog syslog Mon May 5 12:27:47 EST [array02: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of array
There are 68 spare disks; you may want to use the vol or aggr command
to create new volumes or aggregates or add disks to the existing aggregate.
FCP service stopped.
Mon May 5 12:27:55 EST [array01: net.ifconfig.takeoverError:warning]: WARNING: 6 errors detected during network takeover processing WARNING: Some network clients may not be able to access the cluster during takeover
Mon May 5 12:27:55 EST [array01: cf.rsrc.takeoverOpFail:error]: Cluster monitor: takeover during ifconfig_2 failed; takeover continuing...
CIFS partner server is running.
Mon May 5 12:27:55 EST [array01 (takeover): cf.rsrc.transitTime:notice]: Top Takeover transit times wafl_replay=2383 {replay_log=2353, mark_replaying=29}, raid=832, rc=410 {hostname=51, ifconfig=46, options=23, options=14, options=10, options=9, ifconfig=1, ifconfig=1, ifconfig=1, route=1}, wafl=405, registry_postrc_phase1=227, raid_replay=179, registry_prerc=115, wafl_sync=74, fmdisk_reserve=70, cifs=70
Mon May 5 12:27:55 EST [array01 (takeover): cf.fm.takeoverComplete:warning]: Cluster monitor: takeover completed
Mon May 5 12:27:55 EST [array01 (takeover):
cf.fm.takeoverDuration:warning]: Cluster monitor: takeover duration time is 7 seconds
Your filers say it's 7 seconds for takeover to complete. Sounds fair!
Mon May 5 12:27:58 EST [array02/array01: asup.smtp.host:info]: Autosupport cannot connect to host smtp.danahermail.com (Network comm problem) for message: REBOOT (CLUSTER TAKEOVER)
You seem to have some network communications problem after the takeover ... but I mentioned that before :)
Mon May 5 12:27:58 EST [array02/array01: asup.smtp.unreach:error]: Autosupport mail was not sent because the system cannot reach any of the mail hosts from the autosupport.mailhost option. (REBOOT (CLUSTER TAKEOVER))
Mon May 5 12:28:00 EST [array01 (takeover): monitor.globalStatus.critical:CRITICAL]: This node has taken over array02.
Mon May 5 12:28:00 EST [array02/array01: monitor.globalStatus.critical:CRITICAL]: array01 has taken over this node.
Mon May 5 12:28:05 EST [array02/array01: nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations have completed for the partner server.
Mon May 5 12:28:07 EST [array01 (takeover): asup.post.sent:notice]: Cluster Notification message posted to IBM: Cluster Notification from array01 (CLUSTER TAKEOVER COMPLETE MANUAL) INFO
Mon May 5 12:32:07 EST [array02/array01: asup.smtp.host:info]: Autosupport cannot connect to host smtp.danahermail.com (Network comm problem) for message: REBOOT (CLUSTER TAKEOVER)
And here we can clearly see that array01 can connect to IBM, hoewever array02 cannot connect to deliver ASUP's.
Mon May 5 12:32:07 EST [array02/array01: asup.smtp.unreach:error]: Autosupport mail was not sent because the system cannot reach any of the mail hosts from the autosupport.mailhost option. (REBOOT (CLUSTER TAKEOVER))
Here is the ifconfig –a from the node that stayed up:
array01(takeover)> ifconfig -a
e0a: flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 02:a0:98:08:22:b7 (auto-1000t-fd-up) flowcontrol full trunked lan0
e0b: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:a0:98:08:22:b6 (auto-unknown-cfg_down) flowcontrol full
e0c: flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 02:a0:98:08:22:b7 (auto-1000t-fd-up) flowcontrol full trunked lan0
e0d: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:a0:98:08:22:b4 (auto-unknown-cfg_down) flowcontrol full
e2a: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:07:43:05:16:98 (auto-10g_sr-fd-cfg_down) flowcontrol full
e2b: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:07:43:05:16:99 (auto-10g_sr-fd-cfg_down) flowcontrol full
lo: flags=19e8049<UP,LOOPBACK,RUNNING,MULTICAST,MULTIHOST,PARTNER_UP,TCPCKSUM> mtu 8160
inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1 ether 00:00:00:00:00:00 (VIA Provider)
lan0: flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
inet 10.28.17.213 netmask 0xffffff00 broadcast 10.28.17.255 partner lan0 (not in use) ether 02:a0:98:08:22:b7 (Enabled virtual interface)
If you have trunked interfaces, there is no need to ifconfig all the seperate interfaces, they will inherit settings from when you configure the trunk.
HTH & HAND !
post both of your /etc/rc's
sounds like they are waaay outta wack
--tmac
On Mon, May 5, 2008 at 1:36 PM, Page, Jeremy jeremy.page@gilbarco.com wrote:
How long should it take a filer to fail over from one head to another? When I force a failover (cf forcetakeover) from one head the other goes down for minutes. Here's what I see on the console. This is a new filer with very little traffic going to it and FC is not even set up yet, all NFS/CIFS.
array01> cf forcetakeover
cf forcetakeover may lead to data corruption; really force a takeover? yes
cf: forcetakeover initiated by operator
array01> Mon May 5 12:27:48 EST [array01: cf.misc.operatorForcedTakeover:warning]: Cluster monitor: forced takeover initiated by operator
Mon May 5 12:27:48 EST [array01: cf.fsm.takeover.forced:info]: Cluster monitor: takeover attempted after cf forcetakeover command
Mon May 5 12:27:48 EST [array01: cf.fsm.stateTransit:warning]: Cluster monitor: UP --> TAKEOVER
Mon May 5 12:27:48 EST [array01: cf.fm.takeoverStarted:warning]: Cluster monitor: takeover started
Mon May 5 12:27:48 EST [array02/array01: coredump.spare.none:info]: No sparecore disk was found.
Mon May 5 12:27:51 EST [array01: raid.vol.replay.nvram:info]: Performing raid replay on volume(s)
Mon May 5 12:27:51 EST [array01: raid.cksum.replay.summary:info]: Replayed 0 checksum blocks.
Mon May 5 12:27:51 EST [array01: raid.stripe.replay.summary:info]: Replayed 0 stripes.
Mon May 5 12:27:54 EST [array02/array01: wafl.replay.done:info]: WAFL log replay completed, 2 seconds
ifconfig: no such media type <xxx>
media type options are: <tp> <tp-fd> <100tx> <100tx-fd> <1000fx>
<auto> <10g-sr>
ifconfig: Unable to determine primary for interface e0a.
ifconfig: e0a: no such interface
ifconfig: Unable to determine primary for interface e0b.
ifconfig: e0b: no such interface
ifconfig: Unable to determine primary for interface e0c.
ifconfig: e0c: no such interface
ifconfig: Unable to determine primary for interface e0d.
ifconfig: e0d: no such interface
ifconfig: Unable to determine primary for interface e2a.
ifconfig: e2a: no such interface
ifconfig: Unable to determine primary for interface e2b.
ifconfig: e2b: no such interface
add net default: gateway 10.28.17.1: network unreachable
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0a.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0b.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0c.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0d.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e2a.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e2b.
Mon May 5 12:27:55 EST [array02/array01: nis.servers.not.available:error]: NIS server(s) not available.
Mon May 5 12:27:55 EST [array02/array01: cf_takeover:info]: relog syslog Mon May 5 12:26:00 EST [array02: monitor.globalStatus.ok:info]: The system's global status is normal.
Mon May 5 12:27:55 EST [array02/array01: cf_takeover:info]: relog syslog Mon May 5 12:27:47 EST [array02: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of array
There are 68 spare disks; you may want to use the vol or aggr command
to create new volumes or aggregates or add disks to the existing aggregate.
FCP service stopped.
Mon May 5 12:27:55 EST [array01: net.ifconfig.takeoverError:warning]: WARNING: 6 errors detected during network takeover processing WARNING: Some network clients may not be able to access the cluster during takeover
Mon May 5 12:27:55 EST [array01: cf.rsrc.takeoverOpFail:error]: Cluster monitor: takeover during ifconfig_2 failed; takeover continuing...
CIFS partner server is running.
Mon May 5 12:27:55 EST [array01 (takeover): cf.rsrc.transitTime:notice]: Top Takeover transit times wafl_replay=2383 {replay_log=2353, mark_replaying=29}, raid=832, rc=410 {hostname=51, ifconfig=46, options=23, options=14, options=10, options=9, ifconfig=1, ifconfig=1, ifconfig=1, route=1}, wafl=405, registry_postrc_phase1=227, raid_replay=179, registry_prerc=115, wafl_sync=74, fmdisk_reserve=70, cifs=70
Mon May 5 12:27:55 EST [array01 (takeover): cf.fm.takeoverComplete:warning]: Cluster monitor: takeover completed
Mon May 5 12:27:55 EST [array01 (takeover): cf.fm.takeoverDuration:warning]: Cluster monitor: takeover duration time is 7 seconds
Mon May 5 12:27:58 EST [array02/array01: asup.smtp.host:info]: Autosupport cannot connect to host smtp.danahermail.com (Network comm problem) for message: REBOOT (CLUSTER TAKEOVER)
Mon May 5 12:27:58 EST [array02/array01: asup.smtp.unreach:error]: Autosupport mail was not sent because the system cannot reach any of the mail hosts from the autosupport.mailhost option. (REBOOT (CLUSTER TAKEOVER))
Mon May 5 12:28:00 EST [array01 (takeover): monitor.globalStatus.critical:CRITICAL]: This node has taken over array02.
Mon May 5 12:28:00 EST [array02/array01: monitor.globalStatus.critical:CRITICAL]: array01 has taken over this node.
Mon May 5 12:28:05 EST [array02/array01: nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations have completed for the partner server.
Mon May 5 12:28:07 EST [array01 (takeover): asup.post.sent:notice]: Cluster Notification message posted to IBM: Cluster Notification from array01 (CLUSTER TAKEOVER COMPLETE MANUAL) INFO
Mon May 5 12:32:07 EST [array02/array01: asup.smtp.host:info]: Autosupport cannot connect to host smtp.danahermail.com (Network comm problem) for message: REBOOT (CLUSTER TAKEOVER)
Mon May 5 12:32:07 EST [array02/array01: asup.smtp.unreach:error]: Autosupport mail was not sent because the system cannot reach any of the mail hosts from the autosupport.mailhost option. (REBOOT (CLUSTER TAKEOVER))
Here is the ifconfig –a from the node that stayed up:
array01(takeover)> ifconfig -a
e0a: flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 02:a0:98:08:22:b7 (auto-1000t-fd-up) flowcontrol full trunked lan0
e0b: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:a0:98:08:22:b6 (auto-unknown-cfg_down) flowcontrol full
e0c: flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 02:a0:98:08:22:b7 (auto-1000t-fd-up) flowcontrol full trunked lan0
e0d: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:a0:98:08:22:b4 (auto-unknown-cfg_down) flowcontrol full
e2a: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:07:43:05:16:98 (auto-10g_sr-fd-cfg_down) flowcontrol full
e2b: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:07:43:05:16:99 (auto-10g_sr-fd-cfg_down) flowcontrol full
lo: flags=19e8049<UP,LOOPBACK,RUNNING,MULTICAST,MULTIHOST,PARTNER_UP,TCPCKSUM> mtu 8160
inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1 ether 00:00:00:00:00:00 (VIA Provider)
lan0: flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
inet 10.28.17.213 netmask 0xffffff00 broadcast 10.28.17.255 partner lan0 (not in use) ether 02:a0:98:08:22:b7 (Enabled virtual interface)
This message (including any attachments) contains confidential and/or proprietary information intended only for the addressee. Any unauthorized disclosure, copying, distribution or reliance on the contents of this information is strictly prohibited and may constitute a violation of law. If you are not the intended recipient, please notify the sender immediately by responding to this e-mail, and delete the message from your system. If you have any questions about this e-mail please notify the sender immediately.
We've got a bunch of new FAS6000 and FAS3000 series clusters in recently and CFO seems to take less than 20 seconds during testing before any real load is placed on them (minimal load during testing). We've seen between 4 and 12 seconds for CF Takeover, and 1 and 16 seconds for CF Giveback.
Using loopback adapters for 'open' FC ports help as well.
________________________________
From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com] On Behalf Of Page, Jeremy Sent: Monday, May 05, 2008 1:37 PM To: toasters@mathworks.com Subject: filer fail over times
How long should it take a filer to fail over from one head to another? When I force a failover (cf forcetakeover) from one head the other goes down for minutes. Here's what I see on the console. This is a new filer with very little traffic going to it and FC is not even set up yet, all NFS/CIFS.
array01> cf forcetakeover
cf forcetakeover may lead to data corruption; really force a takeover? yes
cf: forcetakeover initiated by operator
array01> Mon May 5 12:27:48 EST [array01: cf.misc.operatorForcedTakeover:warning]: Cluster monitor: forced takeover initiated by operator
Mon May 5 12:27:48 EST [array01: cf.fsm.takeover.forced:info]: Cluster monitor: takeover attempted after cf forcetakeover command
Mon May 5 12:27:48 EST [array01: cf.fsm.stateTransit:warning]: Cluster monitor: UP --> TAKEOVER
Mon May 5 12:27:48 EST [array01: cf.fm.takeoverStarted:warning]: Cluster monitor: takeover started
Mon May 5 12:27:48 EST [array02/array01: coredump.spare.none:info]: No sparecore disk was found.
Mon May 5 12:27:51 EST [array01: raid.vol.replay.nvram:info]: Performing raid replay on volume(s)
Mon May 5 12:27:51 EST [array01: raid.cksum.replay.summary:info]: Replayed 0 checksum blocks.
Mon May 5 12:27:51 EST [array01: raid.stripe.replay.summary:info]: Replayed 0 stripes.
Mon May 5 12:27:54 EST [array02/array01: wafl.replay.done:info]: WAFL log replay completed, 2 seconds
ifconfig: no such media type <xxx>
media type options are: <tp> <tp-fd> <100tx> <100tx-fd> <1000fx> <auto> <10g-sr>
ifconfig: Unable to determine primary for interface e0a.
ifconfig: e0a: no such interface
ifconfig: Unable to determine primary for interface e0b.
ifconfig: e0b: no such interface
ifconfig: Unable to determine primary for interface e0c.
ifconfig: e0c: no such interface
ifconfig: Unable to determine primary for interface e0d.
ifconfig: e0d: no such interface
ifconfig: Unable to determine primary for interface e2a.
ifconfig: e2a: no such interface
ifconfig: Unable to determine primary for interface e2b.
ifconfig: e2b: no such interface
add net default: gateway 10.28.17.1: network unreachable
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0a.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0b.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0c.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e0d.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e2a.
Mon May 5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine primary for interface e2b.
Mon May 5 12:27:55 EST [array02/array01: nis.servers.not.available:error]: NIS server(s) not available.
Mon May 5 12:27:55 EST [array02/array01: cf_takeover:info]: relog syslog Mon May 5 12:26:00 EST [array02: monitor.globalStatus.ok:info]: The system's global status is normal.
Mon May 5 12:27:55 EST [array02/array01: cf_takeover:info]: relog syslog Mon May 5 12:27:47 EST [array02: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of array
There are 68 spare disks; you may want to use the vol or aggr command
to create new volumes or aggregates or add disks to the existing aggregate.
FCP service stopped.
Mon May 5 12:27:55 EST [array01: net.ifconfig.takeoverError:warning]: WARNING: 6 errors detected during network takeover processing WARNING: Some network clients may not be able to access the cluster during takeover
Mon May 5 12:27:55 EST [array01: cf.rsrc.takeoverOpFail:error]: Cluster monitor: takeover during ifconfig_2 failed; takeover continuing...
CIFS partner server is running.
Mon May 5 12:27:55 EST [array01 (takeover): cf.rsrc.transitTime:notice]: Top Takeover transit times wafl_replay=2383 {replay_log=2353, mark_replaying=29}, raid=832, rc=410 {hostname=51, ifconfig=46, options=23, options=14, options=10, options=9, ifconfig=1, ifconfig=1, ifconfig=1, route=1}, wafl=405, registry_postrc_phase1=227, raid_replay=179, registry_prerc=115, wafl_sync=74, fmdisk_reserve=70, cifs=70
Mon May 5 12:27:55 EST [array01 (takeover): cf.fm.takeoverComplete:warning]: Cluster monitor: takeover completed
Mon May 5 12:27:55 EST [array01 (takeover): cf.fm.takeoverDuration:warning]: Cluster monitor: takeover duration time is 7 seconds
Mon May 5 12:27:58 EST [array02/array01: asup.smtp.host:info]: Autosupport cannot connect to host smtp.danahermail.com (Network comm problem) for message: REBOOT (CLUSTER TAKEOVER)
Mon May 5 12:27:58 EST [array02/array01: asup.smtp.unreach:error]: Autosupport mail was not sent because the system cannot reach any of the mail hosts from the autosupport.mailhost option. (REBOOT (CLUSTER TAKEOVER))
Mon May 5 12:28:00 EST [array01 (takeover): monitor.globalStatus.critical:CRITICAL]: This node has taken over array02.
Mon May 5 12:28:00 EST [array02/array01: monitor.globalStatus.critical:CRITICAL]: array01 has taken over this node.
Mon May 5 12:28:05 EST [array02/array01: nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations have completed for the partner server.
Mon May 5 12:28:07 EST [array01 (takeover): asup.post.sent:notice]: Cluster Notification message posted to IBM: Cluster Notification from array01 (CLUSTER TAKEOVER COMPLETE MANUAL) INFO
Mon May 5 12:32:07 EST [array02/array01: asup.smtp.host:info]: Autosupport cannot connect to host smtp.danahermail.com (Network comm problem) for message: REBOOT (CLUSTER TAKEOVER)
Mon May 5 12:32:07 EST [array02/array01: asup.smtp.unreach:error]: Autosupport mail was not sent because the system cannot reach any of the mail hosts from the autosupport.mailhost option. (REBOOT (CLUSTER TAKEOVER))
Here is the ifconfig -a from the node that stayed up:
array01(takeover)> ifconfig -a
e0a: flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 02:a0:98:08:22:b7 (auto-1000t-fd-up) flowcontrol full
trunked lan0
e0b: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:a0:98:08:22:b6 (auto-unknown-cfg_down) flowcontrol full
e0c: flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 02:a0:98:08:22:b7 (auto-1000t-fd-up) flowcontrol full
trunked lan0
e0d: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:a0:98:08:22:b4 (auto-unknown-cfg_down) flowcontrol full
e2a: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:07:43:05:16:98 (auto-10g_sr-fd-cfg_down) flowcontrol full
e2b: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:07:43:05:16:99 (auto-10g_sr-fd-cfg_down) flowcontrol full
lo: flags=19e8049<UP,LOOPBACK,RUNNING,MULTICAST,MULTIHOST,PARTNER_UP,TCPCKSU M> mtu 8160
inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1
ether 00:00:00:00:00:00 (VIA Provider)
lan0: flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
inet 10.28.17.213 netmask 0xffffff00 broadcast 10.28.17.255
partner lan0 (not in use)
ether 02:a0:98:08:22:b7 (Enabled virtual interface)
This message (including any attachments) contains confidential and/or proprietary information intended only for the addressee. Any unauthorized disclosure, copying, distribution or reliance on the contents of this information is strictly prohibited and may constitute a violation of law. If you are not the intended recipient, please notify the sender immediately by responding to this e-mail, and delete the message from your system. If you have any questions about this e-mail please notify the sender immediately.