head swap of Fabric Metro-Cluster?

List overview All Threads
Download

newer

older

SIS metadata issue

Disk 6c.38 has exceeded the...

Jan-Pieter Cornet

15 Apr 2013 15 Apr '13

12:55 p.m.

Hi,

Does anyone have experience doing a head swap of a FMC, from 6080 heads to 6290 heads? And hints beyond what's already in the "Upgrading a FAS60xx system in an HA pair to a FAS62xx system in an HA pair" document?

Is there any reason why you can't do this, during a HA pair upgrade, to limit service downtime: - doing a failover of node 1 to node 2. - shut down and replace the hardware of node 1. - shutting down node 2. - bring up the now-replaced node 1, reassign disks, and takeover node 2 on the new node 1. - replacing the node 2 hardware - doing a giveback on node 1 to node 2.

-- Jan-Pieter Cornet "Most seasonal greetings are sent by spammers and phishers."

Attachments:

signature.asc (application/pgp-signature — 332 bytes)

Show replies by date

tmac

15 Apr 15 Apr

1:06 p.m.

totally unsupported.

For starters, the NVRAM is different in these models. Still, head swaps, as far as I know, are not supported during takeover/giveback.... Environment variables are set that may be unique also. Been a while since I have checked.

You need to be careful with disk assignments (software ownership) networking interfaces may not line up and will require correction.

Basically, don't do it. Don't try it. Follow the upgrade guide.

--tmac

*Tim McCarthy* *Principal Consultant*

Clustered ONTAP Clustered ONTAP NCDA ID: XK7R3GEKC1QQ2LVD RHCE5 805007643429572 NCSIE ID: C14QPHE21FR4YWD4 Expires: 08 November 2014 Expires w/release of RHEL7 Expires: 08 November 2014

On Mon, Apr 15, 2013 at 8:55 AM, Jan-Pieter Cornet johnpc@xs4all.nl wrote:

...

Hi,

Does anyone have experience doing a head swap of a FMC, from 6080 heads to 6290 heads? And hints beyond what's already in the "Upgrading a FAS60xx system in an HA pair to a FAS62xx system in an HA pair" document?

Is there any reason why you can't do this, during a HA pair upgrade, to limit service downtime:

doing a failover of node 1 to node 2.

shut down and replace the hardware of node 1.

shutting down node 2.

bring up the now-replaced node 1, reassign disks, and takeover node 2 on

the new node 1.

replacing the node 2 hardware

doing a giveback on node 1 to node 2.

-- Jan-Pieter Cornet "Most seasonal greetings are sent by spammers and phishers."

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Jan-Pieter Cornet

2:14 p.m.

On 2013-4-15 15:06 , tmac wrote:

...

totally unsupported.

For starters, the NVRAM is different in these models. Still, head swaps, as far as I know, are not supported during takeover/giveback....

I don't think you understand my intention. What I plan to do is:

- takeover node 1 (fas 6080-1) by node 2 (fas 6080-2). This is a regular, supported operation. - power down node 1 (fas 6080-1). Shouldn't have any impact, because node 2 (fas 6080-2) is taking over the service. - remove node 1 (fas 6080-1) from rack, put replacement node 1 (fas 6290-1) in rack, NOT powered on.

- Now, we shut down node 2 (fas 6080-2). At this point there is a service interruption. Proceed to power down node 2 (fas 6080-2).

Then we bring up the new node 1 (fas 6290-1), standalone (no connection to partner). Reassign disks, make sure interfaces are aligned and OS versions are aligned etc. Then reboot node 1. This should bring back services that are on node 1, now on new hardware (except that it's not HA yet, because the partner hardware isn't there).

If possible, we would now like to do a "takeover" of the as-yet non-existent node 2 on the new node 1, so services that are configured on node 2 will be available again.

We then proceed to remove the old node 2 (fas 6080-2) from the rack, which is off anyway, replace it by the new node 2 (fas 6290-2), make sure cables are properly connected, then boot it and do a giveback on node 1.

I am aware that you cannot replace one node during a takeover, and connect the NVRAM cards of two different hardware types and expect it to work. But that's not what we are trying to do. All we're trying is making use of the existing failover capability to reduce downtime. First there's a 6080->6080 takeover, then the system goes down completely, then there's a 6290<-6290 takeover, and then it's back to normal. Where's the unsupported bit?

...

Environment variables are set that may be unique also. Been a while since I have checked.

You need to be careful with disk assignments (software ownership) networking interfaces may not line up and will require correction.

I'm aware of these issues, they are addressed in the standard HA pair headswap guide. It just amazes me that the standard guide doesn't give the option to minimize downtime by doing the extra failovers. Maybe that's because most HA configurations these days use one enclosure, so they cannot be taken from the rack separately? (That's obviously not the case for metrocluster configurations, where the heads are in different cabinets by design).

Or is there some information written to disk during a takeover that would prevent new hardware from picking things up? I find that hard to believe, because in case node 1 goes up in flames, it'll have to be replaced with new hardware anyway... so there should be a way to recover from that.

-- Jan-Pieter Cornet "Most seasonal greetings are sent by spammers and phishers."

tmac

2:39 p.m.

I completely understand. It is *not* a supported operation. There are *still* NVRAM handshakes that happen and by changing the head you *WILL* have major problems.

You have head two that has taken over head one. -> YOU HAVE A TAKOEVER OPERATION.

If you shut down and replace the heads, they will likely be expecting NVRAM behavior from the FAS6080s that are no longer there. You cannot change heads

Why not just:

(and I am sure this very abbreviated from the upgrade guide: LOTS of missing steps!)

cf disable (to kill takeover/giveback operations) halt -c on one head to shut it down without triggering a takeover replace the head you just shut down and start it back up and make it work correctly (ignore mismatch errors for now)

halt -c on the other head replace the head you just shut down and start it back up and make it work correctly

After you are sure everything is normal/OK -> cf enable to re-nable the cluster

Think about it, your downtime will be minimal. You are only taking down one head at a time.

Please follow the upgrade guides.

--tmac

*Tim McCarthy* *Principal Consultant*

Clustered ONTAP Clustered ONTAP NCDA ID: XK7R3GEKC1QQ2LVD RHCE5 805007643429572 NCSIE ID: C14QPHE21FR4YWD4 Expires: 08 November 2014 Expires w/release of RHEL7 Expires: 08 November 2014

On Mon, Apr 15, 2013 at 10:14 AM, Jan-Pieter Cornet johnpc@xs4all.nlwrote:

...

On 2013-4-15 15:06 , tmac wrote:

...
totally unsupported.

For starters, the NVRAM is different in these models. Still, head swaps, as far as I know, are not supported during

takeover/giveback....

I don't think you understand my intention. What I plan to do is:

takeover node 1 (fas 6080-1) by node 2 (fas 6080-2). This is a regular,

supported operation.

power down node 1 (fas 6080-1). Shouldn't have any impact, because node

2 (fas 6080-2) is taking over the service.

remove node 1 (fas 6080-1) from rack, put replacement node 1 (fas

6290-1) in rack, NOT powered on.

Now, we shut down node 2 (fas 6080-2). At this point there is a service

interruption. Proceed to power down node 2 (fas 6080-2).

Then we bring up the new node 1 (fas 6290-1), standalone (no connection to partner). Reassign disks, make sure interfaces are aligned and OS versions are aligned etc. Then reboot node 1. This should bring back services that are on node 1, now on new hardware (except that it's not HA yet, because the partner hardware isn't there).

If possible, we would now like to do a "takeover" of the as-yet non-existent node 2 on the new node 1, so services that are configured on node 2 will be available again.

We then proceed to remove the old node 2 (fas 6080-2) from the rack, which is off anyway, replace it by the new node 2 (fas 6290-2), make sure cables are properly connected, then boot it and do a giveback on node 1.

I am aware that you cannot replace one node during a takeover, and connect the NVRAM cards of two different hardware types and expect it to work. But that's not what we are trying to do. All we're trying is making use of the existing failover capability to reduce downtime. First there's a 6080->6080 takeover, then the system goes down completely, then there's a 6290<-6290 takeover, and then it's back to normal. Where's the unsupported bit?

...
Environment variables are set that may be unique also. Been a while

since I have checked.

...
You need to be careful with disk assignments (software ownership) networking interfaces may not line up and will require correction.

I'm aware of these issues, they are addressed in the standard HA pair headswap guide. It just amazes me that the standard guide doesn't give the option to minimize downtime by doing the extra failovers. Maybe that's because most HA configurations these days use one enclosure, so they cannot be taken from the rack separately? (That's obviously not the case for metrocluster configurations, where the heads are in different cabinets by design).

Or is there some information written to disk during a takeover that would prevent new hardware from picking things up? I find that hard to believe, because in case node 1 goes up in flames, it'll have to be replaced with new hardware anyway... so there should be a way to recover from that.

-- Jan-Pieter Cornet "Most seasonal greetings are sent by spammers and phishers."

Jan-Pieter Cornet

3:37 p.m.

On 2013-4-15 16:39 , tmac wrote:

...

I completely understand. It is *not* a supported operation.

There are *still* NVRAM handshakes that happen and by changing the head you *WILL* have major problems.

The NVRAM of the old and new systems would never be connected... but apparently I don't understand enough about this functionality, so I'll just study the upgrade guide.

...

Think about it, your downtime will be minimal. You are only taking down one head at a time.

Err, not really. If any service is unavailable, then our platform is down. Downtime would start when the first head goes down, and end when the last head goes up again. The nature of our system doesn't allow for a partial unavailability. In practice, that means a downtime of hours instead of minutes. If we cannot use takeover, we might as well replace both heads at the same time.

-- Jan-Pieter Cornet "Most seasonal greetings are sent by spammers and phishers."

Sebastian Goetze

16 Apr 16 Apr

5:41 a.m.

Hi Jan,

technically I follow you (but it's probably still unsupported), but 2 there's things:

* You'd reassign the disks the following way: o (6080 node 2 still running in takeover mode) o switch on new node1 (connected to disks already) + environment variables will have to be set for the 'old' node2 to be recognized as partner! + will be stuck on boot, since it doesn't find disks it owns. will tell you the new SYSID, though + if in a reboot-loop (instead of waiting for keypress), shut it down again o On node2 do + 'partner' to switch to virtual partner + 'disk reassign -d <new SYSID of new node1> + 'partner' to switch back again + Shut down Node2 cleanly o Now new node1 will be able to boot successfully and since (by now) old node2 should have released he disks, not be 'waiting for giveback'... * But how are you planning to take over the downed node2? o In your case I could imagine 'cf *forcetakeover *-d'... (That's where you need the partner-ID environment vars for...) + *==> this is where I'm not sure if it works...* o Then the dry-run on new node2 to find the new SYSID and set the partner ID (on node2) o Then the partner/disk reassign.../partner (on node1) o *I'm pretty sure a 'cf giveback' probably doesn't work yet... But maybe * + 'cf *forcegiveback*' on node1 to shut down virtual node2 + boot new node2, it should now recognize the disks...

Main thing is to get through this without a panic... :-P

Sebastian

On 15.04.2013 17:37, Jan-Pieter Cornet wrote:

...

On 2013-4-15 16:39 , tmac wrote:

...
I completely understand. It is *not* a supported operation.

There are *still* NVRAM handshakes that happen and by changing the head you *WILL* have major problems.

The NVRAM of the old and new systems would never be connected... but apparently I don't understand enough about this functionality, so I'll just study the upgrade guide.

...
Think about it, your downtime will be minimal. You are only taking down one head at a time.

Err, not really. If any service is unavailable, then our platform is down. Downtime would start when the first head goes down, and end when the last head goes up again. The nature of our system doesn't allow for a partial unavailability. In practice, that means a downtime of hours instead of minutes. If we cannot use takeover, we might as well replace both heads at the same time.

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Borzenkov, Andrey

15 Apr 15 Apr

1:29 p.m.

In short - this will be a real disaster and you will need NetApp support to clean it up.

After "failover of node 1 to node 2" you will neither be able to "bring up the now-replaced node 1" nor to "takeover node 2 on the new node 1" if you ever manage to bring it up.

-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Jan-Pieter Cornet Sent: Monday, April 15, 2013 4:55 PM To: toasters@teaparty.net Subject: head swap of Fabric Metro-Cluster?

Hi,

-- Jan-Pieter Cornet "Most seasonal greetings are sent by spammers and phishers."

4510

Age (days ago)

4511

Last active (days ago)

toasters@lists.teaparty.net

6 comments

4 participants

tags (0)

participants (4)

Borzenkov, Andrey
Jan-Pieter Cornet
Sebastian Goetze
tmac