A couple of things I just gotta throw in here since Adaptec dumped Sun/Veritas HA for the future NetApp CF solution ;^)
1) Failover will take a few minuted indeed. Just no way around that. 2) Clustered failover should be possible on FWD SCSI. That was the way Sun did it. You are limited to your implementation of drive attachment. Sun used a dual initiated SCSI processor (now the StorEDGE arrays) that attached to all the other disks. It works, but it's also really slooooowwww :^) Known as Dual Initiated SCSI. 3) My understanding of HA and CF during failover is this: * The failed server goes away (don't care how or why, it just does) * The second server picks this up via the interconnect and begins a few key things, i.e. spoof the network interfaces of the failed server, grab its vols, etc. Now, unless NetApp is implementing some very strange code (which I don't think they are doing <grin> ) the backup NICs on the second server will pick up hostname, ip, and MAC address. As far as I know that's the only way to implement it. 4) Reverting back to normal modes should be the reverse of failing over. It should only take a few minutes, but you should be able to do it live. Otherwise what's the point. If you have to schedule a reboot why have CF? I know this is a simplistic view of the decision process, but if the failover isn't relatively transparent to my NFS and CIFS clients I don't need CF. Sorry, coming back... So you should be able to signal the Dual service machine that the failed machine is coming back online, have it let go of the NIC (basically ifconfig down) and volumes, and have the new machine pick up any NVRAM updates for itself and continue chugging away.
Failover and failback should be that easy. I'll admit that it's a pain to set up any HA or CF solution, but once you get it configured it's a breeze.
BTW having a backup machine idle will still take an apparent "reboot" to bring on-line, and not all your clients will reconnect due to MAC address changes.
I hope this info helps a bit. Sorry for the soap box :^)
--Steve
Stephen C. Losen wrote:
At the www.netapp.com web site is a white paper on netapp's clustered failover (CF) design. We have two F630s that we plan to convert to CF. Netapp has not released CF yet, but we have read the white paper and had some of our questions answered by tech folks.
Clustered failover connects two filers to the same stack of disk shelves. Each filer has its own set of volumes built on its own set of disks, so during normal operation, the cluster behaves like (and has the performance of) two separate filers. If one filer fails, the healthy filer takes over the failed filer's volumes and starts serving them.
It looks like a failover will be about as disruptive as a reboot, i.e., NFS mounts via udp will survive, but all CIFS connections will be lost (on the failed filer). Presumably the healthy filer will not have its service interrupted. Failover will take a few minutes, so it will appear as if the failed filer simply rebooted.
Once the failed hardware is repaired, I'm pretty sure that going back to normal operation requires rebooting the healthy filer, because it has to "offline" the volumes that it took over and that requires a reboot.
So CF doesn't eliminate service disruptions. It just means that certain hardware failures become no more disruptive than one unscheduled reboot and one scheduled reboot. Plus you have poorer performance while one filer does the work of two.
Designing CF so that there are absolutely no service disruptions might very well require one filer to be in "standby" mode at all times and do no work during normal operations. Since hardware failures are rare, this is wasteful.
The key to clustered failover is that each filer uses only half of its own NVRAM and mirrors its NVRAM state onto the unused half of its partner's NVRAM. So in the case of a hardware failure, the healty filer takes over the volumes of the failed filer, takes over the IP address(es) of the failed filer, and starts serving the volumes. The takeover procedure is similar to when a standalone filer recovers from an abrupt power outage. The filesystem resumes at the latest WAFL consistency point, and the NVRAM log is replayed to perform transactions that happened after the consistency point.
Clustered failover requires fibre channel disks because you can't hook SCSI disk shelves up to multiple filers. Each fibre channel disk shelf in a CF system has two fibre channel interfaces, and each filer has its own daisy chain linking it to all the shelves. During normal operation each disk is "owned" by one filer. It appears that disks from the same shelf can be owned by different filers, so you can add a disk to any shelf and assign it to either filer. I don't know if hot spares can be left unassigned, so it may be necessary for each filer to have its own hot spare(s) assigned to it. In the event of a failover, the healthy filer takes over all the disks.
Netapp's CF supports only two filers. This makes sense since each disk shelf must have a fibre channel interface for each filer in the cluster. This puts a rather severe limit on the theoretical size of a cluster using this hardware.
Steve Losen scl@virginia.edu phone: 804-924-0640
University of Virginia ITC Unix Support
____________________________________________________
Engineering Systems and Network Administrator
phone: 408 957 2351 fax: 408 957 4895 email: nevets@eng.adaptec.com
Adaptec Milpitas, CA, USA ____________________________________________________