RE: Cluster failover question - toasters

23 Feb 2004


      Below has worked many times on our F740C nodes and the data on both
nodes is accessible the entire time (well outside of the takeover and 
giveback):
- Manually takeover the problem filer from the working filer. (cf takeover)
- Power off and unplug everything (cluster cables, FCAL cables, everything)
   from the problem filer head (only the head, not the disk shelves).
- Replace whatever needs replacing.
- Connect everything except the cluster connection.  I can't remember if I
   left the FCAL cable to the partners disks disconnected as well, but I
   think I did just to be cautious. I do know I keep the FCAL connection to
   it's own disks connected.
- Boot the problem filer with the diagnostic disk and run all diagnostics.
   As long as you cleanly shutdown you can do NVRAM tests, and as long as
   the cluster is not connected you can run all the MB tests. You can run
   the memory tests with everything connected.
- After diagnostics run, take out the diag disk and reboot the filer. It
   should stop at the "waiting for giveback..." statement.
- Do a cf giveback on the working filer.
I've done this about 10 times on a F740C and F880C and have never
had a problem with data not being available, or becomming corrupted.
Just for the record, data center power outages cause most of these
failures, not the NetApps.
Jeff
...
From: Geoff Hardin geoff.hardin@dalsemi.com
To: toasters toasters@mathworks.com
Subject: Cluster failover question
Date: Mon, 23 Feb 2004 16:04:49 -0600
We've always had cluster failover on our filers for those times when 
something goes wrong on one filer and the other filer can serve the data.  
Realistically speaking, that rarely happens because the filers are stable.  
However, as time passes and the filers grow older, the develop more and 
more "personality."
For example, we have a pair of F760s that are our problem children.  As 
much as we'd like to pawn them off to another group, put them out to 
pasture, or replace them outright, that does not appear to be happening in 
the near future.  Unfortunately, one of them has now developed a problem 
with a memory DIMM on the motherboard.  In the past, we've had the luxury 
of being able to shut down the clustered pair of filers, but in our price 
conscious environment, people are asking what is the point of clustering if 
we can't do maintenance and keep the data available.
So, my question is this:  is it possible to work on a filer head while 
serving the data up from the cluster partner?  My concern is that you are 
upsetting the FC-AL integrity because we'll have to unplug the FC-AL cables 
from the adapters on the head when we pull the motherboard tray out.  Then, 
since the recommended course of action is to run diagnostics after 
reseating and/or replacing the DIMMs, could we run a small set of the 
diagnostics before plugging in the FC-AL cables?  Maybe we could / should 
use the FC-AL reset function from the diagnostics menu to get the loops 
back to normal?
Maybe we've just been too cautious with our data, but I'd like to hear from 
other toasters if this is possible, advisable, and safe before putting our 
data at risk.
Thanks,
Geoff
--
Geoff Hardin
geoff.hardin@dalsemi.com
Put on your seatbelt. I wanna try something.
_________________________________________________________________
Find and compare great deals on Broadband access at the MSN High-Speed 
Marketplace. http://click.atdmt.com/AVE/go/onm00200360ave/direct/01/