Re: A netapp failover possibility...???

18 Sep 1998


      A couple of things I just gotta throw in here since Adaptec dumped
Sun/Veritas HA
for the future NetApp CF solution ;^)
1) Failover will take a few minuted indeed. Just no way around that.
2) Clustered failover should be possible on FWD SCSI. That was the way Sun
    did it. You are limited to your implementation of drive attachment. Sun
used a
    dual initiated SCSI processor (now the StorEDGE arrays) that attached to
all
    the other disks. It works, but it's also really slooooowwww :^)
    Known as Dual Initiated SCSI.
3) My understanding of HA and CF during failover is this:
    * The failed server goes away (don't care how or why, it just does)
    * The second server picks this up via the interconnect and begins a few
key
        things, i.e. spoof the network interfaces of the failed server, grab
its vols, etc.
        Now, unless NetApp is implementing some very strange code (which I
don't
        think they are doing <grin> ) the backup NICs on the second server
will pick
        up hostname, ip, and MAC address. As far as I know that's the only
way to
        implement it.
4) Reverting back to normal modes should be the reverse of failing over. It
should
    only take a few minutes, but you should be able to do it live. Otherwise
what's the
    point. If you have to schedule a reboot why have CF? I know this is a
simplistic view
    of the decision process, but if the failover isn't relatively transparent
to my NFS and
    CIFS clients I don't need CF.
    Sorry, coming back...
    So you should be able to signal the Dual service machine that the failed
machine is
    coming back online, have it let go of the NIC (basically ifconfig down)
and volumes,
    and have the new machine pick up any NVRAM updates for itself and
continue
    chugging away.
Failover and failback should be that easy. I'll admit that it's a pain to set
up any HA or
CF solution, but once you get it configured it's a breeze.
BTW having a backup machine idle will still take an apparent "reboot" to
bring on-line,
and not all your clients will reconnect due to MAC address changes.
I hope this info helps a bit. Sorry for the soap box :^)
--Steve
Stephen C. Losen wrote:
...
At the www.netapp.com web site is a white paper on netapp's clustered
failover (CF) design.  We have two F630s that we plan to convert to CF.
Netapp has not released CF yet, but we have read the white paper and had
some of our questions answered by tech folks.
Clustered failover connects two filers to the same stack of disk shelves.
Each filer has its own set of volumes built on its own set of disks, so
during normal operation, the cluster behaves like (and has the performance
of) two separate filers.  If one filer fails, the healthy filer takes over
the failed filer's volumes and starts serving them.
It looks like a failover will be about as disruptive as a reboot, i.e.,
NFS mounts via udp will survive, but all CIFS connections will be lost (on
the failed filer).  Presumably the healthy filer will not have its service
interrupted.  Failover will take a few minutes, so it will appear as if
the failed filer simply rebooted.
Once the failed hardware is repaired, I'm pretty sure that going back
to normal operation requires rebooting the healthy filer, because
it has to "offline" the volumes that it took over and that requires
a reboot.
So CF doesn't eliminate service disruptions.  It just means that
certain hardware failures become no more disruptive than one
unscheduled reboot and one scheduled reboot.  Plus you have poorer
performance while one filer does the work of two.
Designing CF so that there are absolutely no service disruptions
might very well require one filer to be in "standby" mode at all
times and do no work during normal operations.  Since hardware
failures are rare, this is wasteful.
The key to clustered failover is that each filer uses only half of its own
NVRAM and mirrors its NVRAM state onto the unused half of its partner's
NVRAM.  So in the case of a hardware failure, the healty filer takes over
the volumes of the failed filer, takes over the IP address(es) of the
failed filer, and starts serving the volumes.  The takeover procedure is
similar to when a standalone filer recovers from an abrupt power outage.
The filesystem resumes at the latest WAFL consistency point, and the NVRAM
log is replayed to perform transactions that happened after the
consistency point.
Clustered failover requires fibre channel disks because you can't hook
SCSI disk shelves up to multiple filers.  Each fibre channel disk shelf in
a CF system has two fibre channel interfaces, and each filer has its own
daisy chain linking it to all the shelves.  During normal operation each
disk is "owned" by one filer.  It appears that disks from the same shelf
can be owned by different filers, so you can add a disk to any shelf and
assign it to either filer.  I don't know if hot spares can be left
unassigned, so it may be necessary for each filer to have its own hot
spare(s) assigned to it.  In the event of a failover, the healthy filer
takes over all the disks.
Netapp's CF supports only two filers.  This makes sense since each disk
shelf must have a fibre channel interface for each filer in the cluster.
This puts a rather severe limit on the theoretical size of a cluster using
this hardware.
Steve Losen   scl@virginia.edu    phone: 804-924-0640
University of Virginia               ITC Unix Support
____________________________________________________
Engineering Systems and Network Administrator
phone: 408 957 2351
 fax:   408 957 4895
 email: nevets@eng.adaptec.com
Adaptec
 Milpitas, CA, USA
____________________________________________________

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: A netapp failover possibility...???