Stephen;
An excellent understanding on how CF works.
We're looking forward to working with you on your implementation.
Regards
Lew Kirschner -------------
At 11:08 AM 9/18/1998 -0400, Stephen C. Losen wrote:
At the www.netapp.com web site is a white paper on netapp's clustered failover (CF) design. We have two F630s that we plan to convert to CF. Netapp has not released CF yet, but we have read the white paper and had some of our questions answered by tech folks.
Clustered failover connects two filers to the same stack of disk shelves. Each filer has its own set of volumes built on its own set of disks, so during normal operation, the cluster behaves like (and has the performance of) two separate filers. If one filer fails, the healthy filer takes over the failed filer's volumes and starts serving them.
It looks like a failover will be about as disruptive as a reboot, i.e., NFS mounts via udp will survive, but all CIFS connections will be lost (on the failed filer). Presumably the healthy filer will not have its service interrupted. Failover will take a few minutes, so it will appear as if the failed filer simply rebooted.
Once the failed hardware is repaired, I'm pretty sure that going back to normal operation requires rebooting the healthy filer, because it has to "offline" the volumes that it took over and that requires a reboot.
So CF doesn't eliminate service disruptions. It just means that certain hardware failures become no more disruptive than one unscheduled reboot and one scheduled reboot. Plus you have poorer performance while one filer does the work of two.
Designing CF so that there are absolutely no service disruptions might very well require one filer to be in "standby" mode at all times and do no work during normal operations. Since hardware failures are rare, this is wasteful.
The key to clustered failover is that each filer uses only half of its own NVRAM and mirrors its NVRAM state onto the unused half of its partner's NVRAM. So in the case of a hardware failure, the healty filer takes over the volumes of the failed filer, takes over the IP address(es) of the failed filer, and starts serving the volumes. The takeover procedure is similar to when a standalone filer recovers from an abrupt power outage. The filesystem resumes at the latest WAFL consistency point, and the NVRAM log is replayed to perform transactions that happened after the consistency point.
Clustered failover requires fibre channel disks because you can't hook SCSI disk shelves up to multiple filers. Each fibre channel disk shelf in a CF system has two fibre channel interfaces, and each filer has its own daisy chain linking it to all the shelves. During normal operation each disk is "owned" by one filer. It appears that disks from the same shelf can be owned by different filers, so you can add a disk to any shelf and assign it to either filer. I don't know if hot spares can be left unassigned, so it may be necessary for each filer to have its own hot spare(s) assigned to it. In the event of a failover, the healthy filer takes over all the disks.
Netapp's CF supports only two filers. This makes sense since each disk shelf must have a fibre channel interface for each filer in the cluster. This puts a rather severe limit on the theoretical size of a cluster using this hardware.
Steve Losen scl@virginia.edu phone: 804-924-0640
University of Virginia ITC Unix Support
NETWORK APPLIANCE ====================================================================== FAST! SIMPLE! RELIABLE! MULTIPROTOCOL!! ********************************************************************** Lew Kirschner - Eastern Area V-Mail: 732 603 7330 Reseller Mgr (V-Mail Only) Network Appliance Reach #: 914-369-3830 35 Sagamore Avenue Fax #: 914-369-3832 Suffern, New York 10901 e-mail: lewisk@netapp.com http://www.netapp.com ********************************************************************** The Market Leader in Network Attached Storage for Multiprotocol Environments!