We've always had cluster failover on our filers for those times when something goes wrong on one filer and the other filer can serve the data. Realistically speaking, that rarely happens because the filers are stable. However, as time passes and the filers grow older, the develop more and more "personality."
For example, we have a pair of F760s that are our problem children. As much as we'd like to pawn them off to another group, put them out to pasture, or replace them outright, that does not appear to be happening in the near future. Unfortunately, one of them has now developed a problem with a memory DIMM on the motherboard. In the past, we've had the luxury of being able to shut down the clustered pair of filers, but in our price conscious environment, people are asking what is the point of clustering if we can't do maintenance and keep the data available.
So, my question is this: is it possible to work on a filer head while serving the data up from the cluster partner? My concern is that you are upsetting the FC-AL integrity because we'll have to unplug the FC-AL cables from the adapters on the head when we pull the motherboard tray out. Then, since the recommended course of action is to run diagnostics after reseating and/or replacing the DIMMs, could we run a small set of the diagnostics before plugging in the FC-AL cables? Maybe we could / should use the FC-AL reset function from the diagnostics menu to get the loops back to normal?
Maybe we've just been too cautious with our data, but I'd like to hear from other toasters if this is possible, advisable, and safe before putting our data at risk.
Thanks,
Geoff
Geoff Hardin wrote:
So, my question is this: is it possible to work on a filer head while serving the data up from the cluster partner? My concern is that you are upsetting the FC-AL integrity because we'll have to unplug the FC-AL cables from the adapters on the head when we pull the motherboard tray out. Then, since the recommended course of action is to run diagnostics after reseating and/or replacing the DIMMs, could we run a small set of the diagnostics before plugging in the FC-AL cables? Maybe we could / should use the FC-AL reset function from the diagnostics menu to get the loops back to normal?
sure is. in fact, I use failover far more for maintenance and upgrades than for protection from a head failing ;-)
once you do a failover and power off the "failed" head, pulling the FC cables should have no effect, since those loops are unused at that point.
leave the FC cables pulled, and run just the memory diagnostics.
Maybe we've just been too cautious with our data, but I'd like to hear from other toasters if this is possible, advisable, and safe before putting our data at risk.
YMMV, but we've never had a problem working on a filer head while its partner servered all the data for the cluster.
one caution: make sure your physical installation is such that you can do the mechanical work you need (open the mobo tray, pull cables, etc) without affecting the other filer or the power to the disk shelves.
-skottie
At 2/23/2004 03:08 PM, Skottie Miller wrote:
Geoff Hardin wrote:
So, my question is this: is it possible to work on a filer head while serving the data up from the cluster partner? My concern is that you are upsetting the FC-AL integrity because we'll have to unplug the FC-AL cables from the adapters on the head when we pull the motherboard tray out. Then, since the recommended course of action is to run diagnostics after reseating and/or replacing the DIMMs, could we run a small set of the diagnostics before plugging in the FC-AL cables? Maybe we could / should use the FC-AL reset function from the diagnostics menu to get the loops back to normal?
sure is. in fact, I use failover far more for maintenance and upgrades
Can you upgrade one side and bring it up running a different DOT version from it's partner? And then have it takeover the partner, then down the partner to upgrade it?
Thanks,
Kerry
than for protection from a head failing ;-)
once you do a failover and power off the "failed" head, pulling the FC cables should have no effect, since those loops are unused at that point.
leave the FC cables pulled, and run just the memory diagnostics.
Maybe we've just been too cautious with our data, but I'd like to hear from other toasters if this is possible, advisable, and safe before putting our data at risk.
YMMV, but we've never had a problem working on a filer head while its partner servered all the data for the cluster.
one caution: make sure your physical installation is such that you can do the mechanical work you need (open the mobo tray, pull cables, etc) without affecting the other filer or the power to the disk shelves.
-skottie
--
Scott Miller | Animation Technology work: skottie@dreamworks.com | Dreamworks Feature Animation life: skottie@pobox.com | http://skottie-di.net "Change is required. Growth is optional" -anon
================================================================ Margaret.K.Herschel@jpl.nasa.gov JPL Information Services Jet Propulsion Laboratory Phone 818.354.1111
DISCLAIMER: The personal and professional opinions presented herein are my own and do not, in any way, represent the opinion or policy of JPL.
Kerry Herschel wrote:
Can you upgrade one side and bring it up running a different DOT version from it's partner? And then have it takeover the partner, then down the partner to upgrade it?
I always upgrade DOT to the same version on both heads of a cluster at the same time, and reboot them both a the same time. depending on your config, a DOT upgrade causes a 90 - 180 second outage for the reboot.
-skottie
On the hardware side, I can vouch for the others. There's no problem working on the "failed" head while serving data from the other. We've done it many times on our old hardware and plan to do it on our new F940 cluster when we need to.
We've also done rolling upgrades of DOT as mentioned below on our old F760 cluster. Upgrade one head, fail over, reboot that head, giveback, upgrade the other head, fail over, reboot the second head, and giveback. The first head will complain that it's not at the same DOT level as its partner, but we've never had a problem with data integrity. This also only happens for a short period between the first giveback and the second failover. I don't know if this is supported or not so as always, test if you can.
One note, we've never done this for major version upgrades (i.e. 5.x to 6.x) this way. IIRC we've only done minor version upgrades this way (i.e. 6.4.1 to 6.4.2 or 6.4.1P2 to 6.4.2).
Jeff
------------------------------------------------------------------------- "Allow me to extol the virtues of the Net Fairy, and of all the fantastic dorks that make the nice packets go from here to there. Amen." TB - Penny Arcade -------------------------------------------------------------------------
Scott Miller skottie@anim.dreamworks.com Sent by: owner-toasters@mathworks.com 02/24/2004 03:08 PM
To Kerry Herschel margaret.k.herschel@jpl.nasa.gov cc Geoff Hardin geoff.hardin@dalsemi.com, toasters toasters@mathworks.com Subject Re: Cluster failover question
Kerry Herschel wrote:
Can you upgrade one side and bring it up running a different DOT version
from it's partner? And then have it takeover the partner, then down the partner to upgrade it?
I always upgrade DOT to the same version on both heads of a cluster at the same time, and reboot them both a the same time. depending on your config, a DOT upgrade causes a 90 - 180 second outage for the reboot.
-skottie