filer fail over times - toasters

5 May 2008


      How long should it take a filer to fail over from one head to another?
When I force a failover (cf forcetakeover) from one head the other goes
down for minutes. Here's what I see on the console. This is a new filer
with very little traffic going to it and FC is not even set up yet, all
NFS/CIFS.
array01> cf forcetakeover
cf forcetakeover may lead to data corruption; really force a takeover?
yes
cf: forcetakeover initiated by operator
array01> Mon May  5 12:27:48 EST [array01:
cf.misc.operatorForcedTakeover:warning]: Cluster monitor: forced
takeover initiated by operator
Mon May  5 12:27:48 EST [array01: cf.fsm.takeover.forced:info]: Cluster
monitor: takeover attempted after cf forcetakeover command
Mon May  5 12:27:48 EST [array01: cf.fsm.stateTransit:warning]: Cluster
monitor: UP --> TAKEOVER
Mon May  5 12:27:48 EST [array01: cf.fm.takeoverStarted:warning]:
Cluster monitor: takeover started
Mon May  5 12:27:48 EST [array02/array01: coredump.spare.none:info]: No
sparecore disk was found.
Mon May  5 12:27:51 EST [array01: raid.vol.replay.nvram:info]:
Performing raid replay on volume(s)
Mon May  5 12:27:51 EST [array01: raid.cksum.replay.summary:info]:
Replayed 0 checksum blocks.
Mon May  5 12:27:51 EST [array01: raid.stripe.replay.summary:info]:
Replayed 0 stripes.
Mon May  5 12:27:54 EST [array02/array01: wafl.replay.done:info]: WAFL
log replay completed, 2 seconds
ifconfig: no such media type <xxx>
media type options are: <tp> <tp-fd> <100tx> <100tx-fd>
<1000fx> <auto> <10g-sr>
ifconfig: Unable to determine primary for interface e0a.
ifconfig: e0a: no such interface
ifconfig: Unable to determine primary for interface e0b.
ifconfig: e0b: no such interface
ifconfig: Unable to determine primary for interface e0c.
ifconfig: e0c: no such interface
ifconfig: Unable to determine primary for interface e0d.
ifconfig: e0d: no such interface
ifconfig: Unable to determine primary for interface e2a.
ifconfig: e2a: no such interface
ifconfig: Unable to determine primary for interface e2b.
ifconfig: e2b: no such interface
add net default: gateway 10.28.17.1: network unreachable
Mon May  5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]:
ifconfig: Unable to determine primary for interface e0a.
Mon May  5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]:
ifconfig: Unable to determine primary for interface e0b.
Mon May  5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]:
ifconfig: Unable to determine primary for interface e0c.
Mon May  5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]:
ifconfig: Unable to determine primary for interface e0d.
Mon May  5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]:
ifconfig: Unable to determine primary for interface e2a.
Mon May  5 12:27:55 EST [array02/array01: net.ifconfig.noLocal:error]:
ifconfig: Unable to determine primary for interface e2b.
Mon May  5 12:27:55 EST [array02/array01:
nis.servers.not.available:error]: NIS server(s) not available.
Mon May  5 12:27:55 EST [array02/array01: cf_takeover:info]: relog
syslog Mon May  5 12:26:00 EST [array02: monitor.globalStatus.ok:info]:
The system's global status is normal.
Mon May  5 12:27:55 EST [array02/array01: cf_takeover:info]: relog
syslog Mon May  5 12:27:47 EST [array02:
cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of
array
There are 68 spare disks; you may want to use the vol or aggr command
to create new volumes or aggregates or add disks to the existing
aggregate.
FCP service stopped.
Mon May  5 12:27:55 EST [array01: net.ifconfig.takeoverError:warning]:
WARNING: 6 errors detected during network takeover processing WARNING:
Some network clients may not be able to access the cluster during
takeover
Mon May  5 12:27:55 EST [array01: cf.rsrc.takeoverOpFail:error]: Cluster
monitor: takeover during ifconfig_2 failed; takeover continuing...
CIFS partner server is running.
Mon May  5 12:27:55 EST [array01 (takeover):
cf.rsrc.transitTime:notice]: Top Takeover transit times wafl_replay=2383
{replay_log=2353, mark_replaying=29}, raid=832, rc=410 {hostname=51,
ifconfig=46, options=23, options=14, options=10, options=9, ifconfig=1,
ifconfig=1, ifconfig=1, route=1}, wafl=405, registry_postrc_phase1=227,
raid_replay=179, registry_prerc=115, wafl_sync=74, fmdisk_reserve=70,
cifs=70
Mon May  5 12:27:55 EST [array01 (takeover):
cf.fm.takeoverComplete:warning]: Cluster monitor: takeover completed
Mon May  5 12:27:55 EST [array01 (takeover):
cf.fm.takeoverDuration:warning]: Cluster monitor: takeover duration time
is 7 seconds
Mon May  5 12:27:58 EST [array02/array01: asup.smtp.host:info]:
Autosupport cannot connect to host smtp.danahermail.com (Network comm
problem) for message: REBOOT (CLUSTER TAKEOVER)
Mon May  5 12:27:58 EST [array02/array01: asup.smtp.unreach:error]:
Autosupport mail was not sent because the system cannot reach any of the
mail hosts from the autosupport.mailhost option. (REBOOT (CLUSTER
TAKEOVER))
Mon May  5 12:28:00 EST [array01 (takeover):
monitor.globalStatus.critical:CRITICAL]: This node has taken over
array02.
Mon May  5 12:28:00 EST [array02/array01:
monitor.globalStatus.critical:CRITICAL]: array01 has taken over this
node.
Mon May  5 12:28:05 EST [array02/array01:
nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations
have completed for the partner server.
Mon May  5 12:28:07 EST [array01 (takeover): asup.post.sent:notice]:
Cluster Notification message posted to IBM: Cluster Notification from
array01 (CLUSTER TAKEOVER COMPLETE MANUAL) INFO
Mon May  5 12:32:07 EST [array02/array01: asup.smtp.host:info]:
Autosupport cannot connect to host smtp.danahermail.com (Network comm
problem) for message: REBOOT (CLUSTER TAKEOVER)
Mon May  5 12:32:07 EST [array02/array01: asup.smtp.unreach:error]:
Autosupport mail was not sent because the system cannot reach any of the
mail hosts from the autosupport.mailhost option. (REBOOT (CLUSTER
TAKEOVER))
Here is the ifconfig -a from the node that stayed up:
array01(takeover)> ifconfig -a
e0a: flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 02:a0:98:08:22:b7 (auto-1000t-fd-up) flowcontrol full
trunked lan0
e0b: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:a0:98:08:22:b6 (auto-unknown-cfg_down) flowcontrol full
e0c: flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 02:a0:98:08:22:b7 (auto-1000t-fd-up) flowcontrol full
trunked lan0
e0d: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:a0:98:08:22:b4 (auto-unknown-cfg_down) flowcontrol full
e2a: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:07:43:05:16:98 (auto-10g_sr-fd-cfg_down) flowcontrol
full
e2b: flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:07:43:05:16:99 (auto-10g_sr-fd-cfg_down) flowcontrol
full
lo:
flags=19e8049<UP,LOOPBACK,RUNNING,MULTICAST,MULTIHOST,PARTNER_UP,TCPCKSU
M> mtu 8160
inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1
ether 00:00:00:00:00:00 (VIA Provider)
lan0: flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
inet 10.28.17.213 netmask 0xffffff00 broadcast 10.28.17.255
partner lan0 (not in use)
ether 02:a0:98:08:22:b7 (Enabled virtual interface)
This message (including any attachments) contains confidential 
and/or proprietary information intended only for the addressee.  
Any unauthorized disclosure, copying, distribution or reliance on 
the contents of this information is strictly prohibited and may 
constitute a violation of law.  If you are not the intended 
recipient, please notify the sender immediately by responding to 
this e-mail, and delete the message from your system.  If you 
have any questions about this e-mail please notify the sender 
immediately.