-- Adam Fox
adamfox@netapp.com
How long should it
take a filer to fail over from one head to another? When I force a failover (cf
forcetakeover) from one head the other goes down for minutes. Here’s what I see
on the console. This is a new filer with very little traffic going to it and FC
is not even set up yet, all NFS/CIFS.
array01> cf
forcetakeover
cf forcetakeover may lead to data
corruption; really force a takeover? yes
cf: forcetakeover initiated by
operator
array01> Mon May 5
12:27:48 EST [array01: cf.misc.operatorForcedTakeover:warning]: Cluster monitor:
forced takeover initiated by operator
Mon May 5 12:27:48 EST
[array01: cf.fsm.takeover.forced:info]: Cluster monitor: takeover attempted
after cf forcetakeover command
Mon May 5 12:27:48 EST
[array01: cf.fsm.stateTransit:warning]: Cluster monitor: UP -->
TAKEOVER
Mon May 5 12:27:48 EST
[array01: cf.fm.takeoverStarted:warning]: Cluster monitor: takeover
started
Mon May 5 12:27:48 EST
[array02/array01: coredump.spare.none:info]: No sparecore disk was
found.
Mon May 5 12:27:51 EST
[array01: raid.vol.replay.nvram:info]: Performing raid replay on
volume(s)
Mon May 5 12:27:51 EST
[array01: raid.cksum.replay.summary:info]: Replayed 0 checksum
blocks.
Mon May 5 12:27:51 EST
[array01: raid.stripe.replay.summary:info]: Replayed 0
stripes.
Mon May 5 12:27:54 EST
[array02/array01: wafl.replay.done:info]: WAFL log replay completed, 2
seconds
ifconfig: no such media type
<xxx>
media type options are: <tp> <tp-fd> <100tx> <100tx-fd>
<1000fx> <auto> <10g-sr>
ifconfig: Unable to determine
primary for interface e0a.
ifconfig: e0a: no such
interface
ifconfig: Unable to determine
primary for interface e0b.
ifconfig: e0b: no such
interface
ifconfig: Unable to determine
primary for interface e0c.
ifconfig: e0c: no such
interface
ifconfig: Unable to determine
primary for interface e0d.
ifconfig: e0d: no such
interface
ifconfig: Unable to determine
primary for interface e2a.
ifconfig: e2a: no such
interface
ifconfig: Unable to determine
primary for interface e2b.
ifconfig: e2b: no such
interface
add net default: gateway
10.28.17.1: network unreachable
Mon May 5 12:27:55 EST
[array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine
primary for interface e0a.
Mon May 5 12:27:55 EST
[array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine
primary for interface e0b.
Mon May 5 12:27:55 EST
[array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine
primary for interface e0c.
Mon May 5 12:27:55 EST
[array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine
primary for interface e0d.
Mon May 5 12:27:55 EST
[array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine
primary for interface e2a.
Mon May 5 12:27:55 EST
[array02/array01: net.ifconfig.noLocal:error]: ifconfig: Unable to determine
primary for interface e2b.
Mon May 5 12:27:55 EST
[array02/array01: nis.servers.not.available:error]:
Mon May 5 12:27:55 EST
[array02/array01: cf_takeover:info]: relog syslog Mon May 5 12:26:00 EST
[array02: monitor.globalStatus.ok:info]: The system's global status is
normal.
Mon May 5 12:27:55 EST
[array02/array01: cf_takeover:info]: relog syslog Mon May 5 12:27:47 EST
[array02: cf.fsm.takeoverOfPartnerDisabled:notice]: Cluster monitor: takeover of
array
There are 68 spare disks; you may
want to use the vol or aggr command
to create new volumes or aggregates
or add disks to the existing aggregate.
FCP service
stopped.
Mon May 5 12:27:55 EST
[array01: net.ifconfig.takeoverError:warning]: WARNING: 6 errors detected during
network takeover processing WARNING: Some network clients may not be able to
access the cluster during takeover
Mon May 5 12:27:55 EST
[array01: cf.rsrc.takeoverOpFail:error]: Cluster monitor: takeover during
ifconfig_2 failed; takeover continuing...
CIFS partner server is
running.
Mon May 5 12:27:55 EST
[array01 (takeover): cf.rsrc.transitTime:notice]: Top Takeover transit times
wafl_replay=2383 {replay_log=2353, mark_replaying=29}, raid=832, rc=410
{hostname=51, ifconfig=46, options=23, options=14, options=10, options=9,
ifconfig=1, ifconfig=1, ifconfig=1, route=1}, wafl=405,
registry_postrc_phase1=227, raid_replay=179, registry_prerc=115, wafl_sync=74,
fmdisk_reserve=70, cifs=70
Mon May 5 12:27:55 EST
[array01 (takeover): cf.fm.takeoverComplete:warning]: Cluster monitor: takeover
completed
Mon May 5 12:27:55 EST
[array01 (takeover): cf.fm.takeoverDuration:warning]: Cluster monitor: takeover
duration time is 7 seconds
Mon May 5 12:27:58 EST
[array02/array01: asup.smtp.host:info]: Autosupport cannot connect to host
smtp.danahermail.com (Network comm problem) for message: REBOOT (CLUSTER
TAKEOVER)
Mon May 5 12:27:58 EST
[array02/array01: asup.smtp.unreach:error]: Autosupport mail was not sent
because the system cannot reach any of the mail hosts from the
autosupport.mailhost option. (REBOOT (CLUSTER
TAKEOVER))
Mon May 5 12:28:00 EST
[array01 (takeover): monitor.globalStatus.critical:CRITICAL]: This node has
taken over array02.
Mon May 5 12:28:00 EST
[array02/array01: monitor.globalStatus.critical:CRITICAL]: array01 has taken
over this node.
Mon May 5 12:28:05 EST
[array02/array01: nbt.nbns.registrationComplete:info]: NBT: All CIFS name
registrations have completed for the partner
server.
Mon May 5 12:28:07 EST
[array01 (takeover): asup.post.sent:notice]: Cluster Notification message posted
to IBM: Cluster Notification from array01 (CLUSTER TAKEOVER COMPLETE MANUAL)
INFO
Mon May 5 12:32:07 EST
[array02/array01: asup.smtp.host:info]: Autosupport cannot connect to host
smtp.danahermail.com (Network comm problem) for message: REBOOT (CLUSTER
TAKEOVER)
Mon May 5 12:32:07 EST
[array02/array01: asup.smtp.unreach:error]: Autosupport mail was not sent
because the system cannot reach any of the mail hosts from the
autosupport.mailhost option. (REBOOT (CLUSTER
TAKEOVER))
Here is the ifconfig
–a from the node that stayed up:
array01(takeover)> ifconfig
-a
e0a:
flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu
1500
ether 02:a0:98:08:22:b7 (auto-1000t-fd-up) flowcontrol
full
trunked lan0
e0b:
flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu
1500
ether 00:a0:98:08:22:b6 (auto-unknown-cfg_down) flowcontrol
full
e0c:
flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu
1500
ether 02:a0:98:08:22:b7 (auto-1000t-fd-up) flowcontrol
full
trunked lan0
e0d:
flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu
1500
ether 00:a0:98:08:22:b4 (auto-unknown-cfg_down) flowcontrol
full
e2a:
flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu
1500
ether 00:07:43:05:16:98 (auto-10g_sr-fd-cfg_down) flowcontrol
full
e2b:
flags=108042<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu
1500
ether 00:07:43:05:16:99 (auto-10g_sr-fd-cfg_down) flowcontrol
full
lo:
flags=19e8049<UP,LOOPBACK,RUNNING,MULTICAST,MULTIHOST,PARTNER_UP,TCPCKSUM>
mtu 8160
inet 127.0.0.1 netmask 0xff000000 broadcast
127.0.0.1
ether 00:00:00:00:00:00 (VIA Provider)
lan0:
flags=948043<UP,BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu
1500
inet 10.28.17.213 netmask 0xffffff00 broadcast
10.28.17.255
partner lan0 (not in use)
ether 02:a0:98:08:22:b7 (Enabled virtual interface)
This message (including any attachments) contains confidential
and/or
proprietary information intended only for the addressee.
Any unauthorized
disclosure, copying, distribution or reliance on
the contents of this
information is strictly prohibited and may
constitute a violation of law. If
you are not the intended
recipient, please notify the sender immediately by
responding to
this e-mail, and delete the message from your system. If you
have any questions about this e-mail please notify the sender
immediately.