4 node cluster - only two nodes coming up - toasters

List overview All Threads
Download

newer

4 node cluster - only two nodes coming up

older

Forcing SAS Speed

Ndmpcopy times out...

John Stoffel

16 May 2021 16 May '21

1:47 a.m.

Hi all,

We're in the middle of hell, where our 4-node FAS8060 cluster was shutdown cleanly for a move, but only one pair made it onto the truck to the new DC. Luckily I have all the volumes snapmirrored between the two pairs of nodes and their aggregates.

But now I need to bring up the pair that made the trip, figure out which mirrors are source and which are destination on this pair, and then break the destination ones so I can promote them to read-write.

This is not something I've practiced, and I wonder that if I have volume foo, mounted on /foo, and it's snapmirror is volume foo_sm, when I do the break, will it automatically mount to /foo? I guess I'll find out later tonight, and I can just unmount and remount.

I think this is all good with just a simple 'snapmirror break ...' but then when we get the chance to rejoin the other two nodes into the cluster down the line, I would asusme I just have to (maybe) wipe the old nodes and rejoin them one at a time. Mostly because by that point I can't have the original source volumes come up and cause us to lose all the writes that have happened on the now writeable destination volumes.

And of course there's the matter of getting epsilon back up and working on the two node cluster when I reboot it. Along with all the LIFs, etc. Not going to be a fun time. Not at all...

And of course we're out of support with Netapp. Sigh...

And who knows if the pair that came down won't lose some disks and end up losing one or more aggregates as well. Stressful times for sure.

So I'm just venting here, but any suggestions or tricks would be helpful.

And of course I'm not sure if the cluster switches made it down here yet.

Never put your DC on the second floor if there isn't a second freight elevator. Or elevator in general. Sigh...

John

Show replies by date

Timothy Naple

16 May 16 May

2:03 a.m.

John,

If it were me, I would take it one step at a time and first get the systems physically in place, cabled up and clusters switches up and running and once you can login on console or SSH then start looking at the aggregates, LIF's, cluster epsilon/quorum and which volumes are source, etc. Keep me posted if you hit any major roadblocks and I can try to help out.

Thank you, Tim

________________________________ From: Toasters toasters-bounces@teaparty.net on behalf of John Stoffel john@stoffel.org Sent: Saturday, May 15, 2021 6:47 PM To: Toasters@teaparty.net Toasters@teaparty.net Subject: 4 node cluster - only two nodes coming up

Hi all,

And of course there's the matter of getting epsilon back up and working on the two node cluster when I reboot it. Along with all the LIFs, etc. Not going to be a fun time. Not at all...

And of course we're out of support with Netapp. Sigh...

And who knows if the pair that came down won't lose some disks and end up losing one or more aggregates as well. Stressful times for sure.

So I'm just venting here, but any suggestions or tricks would be helpful.

And of course I'm not sure if the cluster switches made it down here yet.

Never put your DC on the second floor if there isn't a second freight elevator. Or elevator in general. Sigh...

John _______________________________________________ Toasters mailing list Toasters@teaparty.net https://www.teaparty.net/mailman/listinfo/toasters

John Stoffel

17 May 17 May

1:56 a.m.

Hi Tim, Thanks for the offer to help. We will just have to see how long it will take to reunite our cluster pairs, though I’d be happy to just dump the missing pair and handle the data loss.

Which won’t be much if anything at all.

Sent from my iPhone

...

On May 15, 2021, at 7:04 PM, Timothy Naple tnaple@berkcom.com wrote:

John,

If it were me, I would take it one step at a time and first get the systems physically in place, cabled up and clusters switches up and running and once you can login on console or SSH then start looking at the aggregates, LIF's, cluster epsilon/quorum and which volumes are source, etc. Keep me posted if you hit any major roadblocks and I can try to help out.

Thank you, Tim

From: Toasters toasters-bounces@teaparty.net on behalf of John Stoffel john@stoffel.org Sent: Saturday, May 15, 2021 6:47 PM To: Toasters@teaparty.net Toasters@teaparty.net Subject: 4 node cluster - only two nodes coming up

Hi all,

We're in the middle of hell, where our 4-node FAS8060 cluster was shutdown cleanly for a move, but only one pair made it onto the truck to the new DC. Luckily I have all the volumes snapmirrored between the two pairs of nodes and their aggregates.

But now I need to bring up the pair that made the trip, figure out which mirrors are source and which are destination on this pair, and then break the destination ones so I can promote them to read-write.

This is not something I've practiced, and I wonder that if I have volume foo, mounted on /foo, and it's snapmirror is volume foo_sm, when I do the break, will it automatically mount to /foo? I guess I'll find out later tonight, and I can just unmount and remount.

I think this is all good with just a simple 'snapmirror break ...' but then when we get the chance to rejoin the other two nodes into the cluster down the line, I would asusme I just have to (maybe) wipe the old nodes and rejoin them one at a time. Mostly because by that point I can't have the original source volumes come up and cause us to lose all the writes that have happened on the now writeable destination volumes.

And of course there's the matter of getting epsilon back up and working on the two node cluster when I reboot it. Along with all the LIFs, etc. Not going to be a fun time. Not at all...

And of course we're out of support with Netapp. Sigh...

And who knows if the pair that came down won't lose some disks and end up losing one or more aggregates as well. Stressful times for sure.

So I'm just venting here, but any suggestions or tricks would be helpful.

And of course I'm not sure if the cluster switches made it down here yet.

Never put your DC on the second floor if there isn't a second freight elevator. Or elevator in general. Sigh...

John _______________________________________________ Toasters mailing list Toasters@teaparty.net https://www.teaparty.net/mailman/listinfo/toasters

andrei.borzenkov＠fujitsu.com

16 May 16 May

9:18 a.m.

You can only bring up two nodes if one of them has epsilon. Otherwise no configuration changes are possible and that includes breaking snap mirror.

If those two nodes do have epsilon, it is just normal procedure to failover LIF. After snap mirror break destinations volumes are not renamed - they remain exactly as they are. Nothing special needs to be done when two remaining nodes arrive - they just join cluster normally.

If these two nodes do not have epsilon, your best bet is to wait for the remaining cluster nodes to arrive. I am not aware of possibility to force epsilon in this case. May be it exists, but you certainly need support case to obtain it and any follow up step.

Скачайте Outlook для iOShttps://aka.ms/o0ukef ________________________________ От: Пользователь Toasters toasters-bounces@teaparty.net от имени пользователя John Stoffel john@stoffel.org Отправлено: воскресенье, мая 16, 2021 04:54 Кому: Toasters@teaparty.net Тема: 4 node cluster - only two nodes coming up

Hi all,

And of course there's the matter of getting epsilon back up and working on the two node cluster when I reboot it. Along with all the LIFs, etc. Not going to be a fun time. Not at all...

And of course we're out of support with Netapp. Sigh...

And who knows if the pair that came down won't lose some disks and end up losing one or more aggregates as well. Stressful times for sure.

So I'm just venting here, but any suggestions or tricks would be helpful.

And of course I'm not sure if the cluster switches made it down here yet.

Never put your DC on the second floor if there isn't a second freight elevator. Or elevator in general. Sigh...

John _______________________________________________ Toasters mailing list Toasters@teaparty.net https://www.teaparty.net/mailman/listinfo/toasters

John Stoffel

17 May 17 May

12:01 a.m.

The two nodes came up ok, the problem is the svms with a root ok on the missing pair of nodes. Escalating with netapp.

Not sure we have the time to wait for elevator repair to bring the missing pair to the new location, ten hours by truck. Been a hellosh weekend.

I wonder if pod sharing rootvols would have saved us some pain? Maybe not since updates wouldn’t be possible.

Sent from my iPhone

...

On May 16, 2021, at 2:18 AM, andrei.borzenkov@fujitsu.com wrote:

You can only bring up two nodes if one of them has epsilon. Otherwise no configuration changes are possible and that includes breaking snap mirror.

If those two nodes do have epsilon, it is just normal procedure to failover LIF. After snap mirror break destinations volumes are not renamed - they remain exactly as they are. Nothing special needs to be done when two remaining nodes arrive - they just join cluster normally.

If these two nodes do not have epsilon, your best bet is to wait for the remaining cluster nodes to arrive. I am not aware of possibility to force epsilon in this case. May be it exists, but you certainly need support case to obtain it and any follow up step.

Скачайте Outlook для iOS От: Пользователь Toasters toasters-bounces@teaparty.net от имени пользователя John Stoffel john@stoffel.org Отправлено: воскресенье, мая 16, 2021 04:54 Кому: Toasters@teaparty.net Тема: 4 node cluster - only two nodes coming up

Hi all,

We're in the middle of hell, where our 4-node FAS8060 cluster was shutdown cleanly for a move, but only one pair made it onto the truck to the new DC. Luckily I have all the volumes snapmirrored between the two pairs of nodes and their aggregates.

But now I need to bring up the pair that made the trip, figure out which mirrors are source and which are destination on this pair, and then break the destination ones so I can promote them to read-write.

This is not something I've practiced, and I wonder that if I have volume foo, mounted on /foo, and it's snapmirror is volume foo_sm, when I do the break, will it automatically mount to /foo? I guess I'll find out later tonight, and I can just unmount and remount.

I think this is all good with just a simple 'snapmirror break ...' but then when we get the chance to rejoin the other two nodes into the cluster down the line, I would asusme I just have to (maybe) wipe the old nodes and rejoin them one at a time. Mostly because by that point I can't have the original source volumes come up and cause us to lose all the writes that have happened on the now writeable destination volumes.

And of course there's the matter of getting epsilon back up and working on the two node cluster when I reboot it. Along with all the LIFs, etc. Not going to be a fun time. Not at all...

And of course we're out of support with Netapp. Sigh...

And who knows if the pair that came down won't lose some disks and end up losing one or more aggregates as well. Stressful times for sure.

So I'm just venting here, but any suggestions or tricks would be helpful.

And of course I'm not sure if the cluster switches made it down here yet.

Never put your DC on the second floor if there isn't a second freight elevator. Or elevator in general. Sigh...

John _______________________________________________ Toasters mailing list Toasters@teaparty.net https://www.teaparty.net/mailman/listinfo/toasters

Sebastian Goetze

1:25 p.m.

Well, this is the 1 case, where Load-Sharing Mirrors would have been helpful...

Too late now, though.

Sebastian

On 17.05.2021 02:01, John Stoffel wrote:

...

The two nodes came up ok, the problem is the svms with a root ok on the missing pair of nodes. Escalating with netapp.

Not sure we have the time to wait for elevator repair to bring the missing pair to the new location, ten hours by truck. Been a hellosh weekend.

I wonder if pod sharing rootvols would have saved us some pain? Maybe not since updates wouldn’t be possible.

Sent from my iPhone

...
On May 16, 2021, at 2:18 AM, andrei.borzenkov@fujitsu.com wrote:

You can only bring up two nodes if one of them has epsilon. Otherwise no configuration changes are possible and that includes breaking snap mirror.

If those two nodes do have epsilon, it is just normal procedure to failover LIF. After snap mirror break destinations volumes are not renamed - they remain exactly as they are. Nothing special needs to be done when two remaining nodes arrive - they just join cluster normally.

If these two nodes do not have epsilon, your best bet is to wait for the remaining cluster nodes to arrive. I am not aware of possibility to force epsilon in this case. May be it exists, but you certainly need support case to obtain it and any follow up step.

Скачайте Outlook для iOS https://aka.ms/o0ukef

*От:* Пользователь Toasters toasters-bounces@teaparty.net от имени пользователя John Stoffel john@stoffel.org *Отправлено:* воскресенье, мая 16, 2021 04:54 *Кому:* Toasters@teaparty.net *Тема:* 4 node cluster - only two nodes coming up

Hi all,

We're in the middle of hell, where our 4-node FAS8060 cluster was shutdown cleanly for a move, but only one pair made it onto the truck to the new DC. Luckily I have all the volumes snapmirrored between the two pairs of nodes and their aggregates.

But now I need to bring up the pair that made the trip, figure out which mirrors are source and which are destination on this pair, and then break the destination ones so I can promote them to read-write.

This is not something I've practiced, and I wonder that if I have volume foo, mounted on /foo, and it's snapmirror is volume foo_sm, when I do the break, will it automatically mount to /foo? I guess I'll find out later tonight, and I can just unmount and remount.

I think this is all good with just a simple 'snapmirror break ...' but then when we get the chance to rejoin the other two nodes into the cluster down the line, I would asusme I just have to (maybe) wipe the old nodes and rejoin them one at a time. Mostly because by that point I can't have the original source volumes come up and cause us to lose all the writes that have happened on the now writeable destination volumes.

And of course there's the matter of getting epsilon back up and working on the two node cluster when I reboot it. Along with all the LIFs, etc. Not going to be a fun time. Not at all...

And of course we're out of support with Netapp. Sigh...

And who knows if the pair that came down won't lose some disks and end up losing one or more aggregates as well. Stressful times for sure.

So I'm just venting here, but any suggestions or tricks would be helpful.

And of course I'm not sure if the cluster switches made it down here yet.

Never put your DC on the second floor if there isn't a second freight elevator. Or elevator in general. Sigh...

John _______________________________________________ Toasters mailing list Toasters@teaparty.net https://www.teaparty.net/mailman/listinfo/toasters https://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net https://www.teaparty.net/mailman/listinfo/toasters

John Stoffel

18 May 18 May

6:09 p.m.

Sebastian> Well, this is the 1 case, where Load-Sharing Mirrors would Sebastian> have been helpful... Too late now, though.

Maybe... it's not clear if it would have helped all that much. What we should have done in hindsight, since we had the capacity, would be to split off two nodes of the cluster ahead of the move, then snapmirror between them.

It still would have been a pain, but recovery would have been simpler. We're still working on issues, but most things are working.

John

Sebastian> Sebastian

Sebastian> On 17.05.2021 02:01, John Stoffel wrote:

Sebastian> The two nodes came up ok, the problem is the svms with a root ok on the missing pair of nodes. Escalating with netapp.

Sebastian> Not sure we have the time to wait for elevator repair to bring the missing pair to the new location, ten hours by truck. Been a hellosh Sebastian> weekend.

Sebastian> I wonder if pod sharing rootvols would have saved us some pain? Maybe not since updates wouldn’t be possible.

Sebastian> Sent from my iPhone

Sebastian> On May 16, 2021, at 2:18 AM, andrei.borzenkov@fujitsu.com wrote:

Sebastian> You can only bring up two nodes if one of them has epsilon. Otherwise no configuration changes are possible and that includes Sebastian> breaking snap mirror.

Sebastian> If those two nodes do have epsilon, it is just normal procedure to failover LIF. After snap mirror break destinations volumes are not Sebastian> renamed - they remain exactly as they are. Nothing special needs to be done when two remaining nodes arrive - they just join cluster Sebastian> normally.

Sebastian> If these two nodes do not have epsilon, your best bet is to wait for the remaining cluster nodes to arrive. I am not aware of Sebastian> possibility to force epsilon in this case. May be it exists, but you certainly need support case to obtain it and any follow up step.

Sebastian> Скачайте Outlook для iOS Sebastian> ------------------------------------------------------------------------------------------------------------------------------------- Sebastian> От: Пользователь Toasters toasters-bounces@teaparty.net от имени пользователя John Stoffel john@stoffel.org Sebastian> Отправлено: воскресенье, мая 16, 2021 04:54 Sebastian> Кому: Toasters@teaparty.net Sebastian> Тема: 4 node cluster - only two nodes coming up Sebastian>

Sebastian> Hi all,

Sebastian> We're in the middle of hell, where our 4-node FAS8060 cluster was Sebastian> shutdown cleanly for a move, but only one pair made it onto the truck Sebastian> to the new DC. Luckily I have all the volumes snapmirrored between Sebastian> the two pairs of nodes and their aggregates.

Sebastian> But now I need to bring up the pair that made the trip, figure out Sebastian> which mirrors are source and which are destination on this pair, and Sebastian> then break the destination ones so I can promote them to read-write.

Sebastian> This is not something I've practiced, and I wonder that if I have Sebastian> volume foo, mounted on /foo, and it's snapmirror is volume foo_sm, Sebastian> when I do the break, will it automatically mount to /foo? I guess Sebastian> I'll find out later tonight, and I can just unmount and remount.

Sebastian> I think this is all good with just a simple 'snapmirror break ...' but Sebastian> then when we get the chance to rejoin the other two nodes into the Sebastian> cluster down the line, I would asusme I just have to (maybe) wipe the Sebastian> old nodes and rejoin them one at a time. Mostly because by that point Sebastian> I can't have the original source volumes come up and cause us to lose Sebastian> all the writes that have happened on the now writeable destination Sebastian> volumes.

Sebastian> And of course there's the matter of getting epsilon back up and Sebastian> working on the two node cluster when I reboot it. Along with all the Sebastian> LIFs, etc. Not going to be a fun time. Not at all...

Sebastian> And of course we're out of support with Netapp. Sigh...

Sebastian> And who knows if the pair that came down won't lose some disks and end Sebastian> up losing one or more aggregates as well. Stressful times for sure.

Sebastian> So I'm just venting here, but any suggestions or tricks would be Sebastian> helpful.

Sebastian> And of course I'm not sure if the cluster switches made it down here Sebastian> yet.

Sebastian> Never put your DC on the second floor if there isn't a second freight Sebastian> elevator. Or elevator in general. Sigh...

Sebastian> John Sebastian> _______________________________________________ Sebastian> Toasters mailing list Sebastian> Toasters@teaparty.net Sebastian> https://www.teaparty.net/mailman/listinfo/toasters

Sebastian> _______________________________________________ Sebastian> Toasters mailing list Sebastian> Toasters@teaparty.net Sebastian> https://www.teaparty.net/mailman/listinfo/toasters

tmac

17 May 17 May

6 p.m.

I had something like this a while back.

I ended up just creating a new 1GB volume and the converting it to the root vol (volume make-vsroot)

--tmac

*Tim McCarthy, **Principal Consultant*

*Proud Member of the #NetAppATeam https://twitter.com/NetAppATeam*

*I Blog at TMACsRack https://tmacsrack.wordpress.com/*

On Sun, May 16, 2021 at 8:05 PM John Stoffel john@stoffel.org wrote:

...

The two nodes came up ok, the problem is the svms with a root ok on the missing pair of nodes. Escalating with netapp.

Not sure we have the time to wait for elevator repair to bring the missing pair to the new location, ten hours by truck. Been a hellosh weekend.

I wonder if pod sharing rootvols would have saved us some pain? Maybe not since updates wouldn’t be possible.

Sent from my iPhone

On May 16, 2021, at 2:18 AM, andrei.borzenkov@fujitsu.com wrote:

You can only bring up two nodes if one of them has epsilon. Otherwise no configuration changes are possible and that includes breaking snap mirror.

If those two nodes do have epsilon, it is just normal procedure to failover LIF. After snap mirror break destinations volumes are not renamed

they remain exactly as they are. Nothing special needs to be done when

two remaining nodes arrive - they just join cluster normally.

If these two nodes do not have epsilon, your best bet is to wait for the remaining cluster nodes to arrive. I am not aware of possibility to force epsilon in this case. May be it exists, but you certainly need support case to obtain it and any follow up step.

Скачайте Outlook для iOS https://aka.ms/o0ukef

*От:* Пользователь Toasters toasters-bounces@teaparty.net от имени пользователя John Stoffel john@stoffel.org *Отправлено:* воскресенье, мая 16, 2021 04:54 *Кому:* Toasters@teaparty.net *Тема:* 4 node cluster - only two nodes coming up

Hi all,

We're in the middle of hell, where our 4-node FAS8060 cluster was shutdown cleanly for a move, but only one pair made it onto the truck to the new DC. Luckily I have all the volumes snapmirrored between the two pairs of nodes and their aggregates.

But now I need to bring up the pair that made the trip, figure out which mirrors are source and which are destination on this pair, and then break the destination ones so I can promote them to read-write.

This is not something I've practiced, and I wonder that if I have volume foo, mounted on /foo, and it's snapmirror is volume foo_sm, when I do the break, will it automatically mount to /foo? I guess I'll find out later tonight, and I can just unmount and remount.

I think this is all good with just a simple 'snapmirror break ...' but then when we get the chance to rejoin the other two nodes into the cluster down the line, I would asusme I just have to (maybe) wipe the old nodes and rejoin them one at a time. Mostly because by that point I can't have the original source volumes come up and cause us to lose all the writes that have happened on the now writeable destination volumes.

And of course there's the matter of getting epsilon back up and working on the two node cluster when I reboot it. Along with all the LIFs, etc. Not going to be a fun time. Not at all...

And of course we're out of support with Netapp. Sigh...

And who knows if the pair that came down won't lose some disks and end up losing one or more aggregates as well. Stressful times for sure.

So I'm just venting here, but any suggestions or tricks would be helpful.

And of course I'm not sure if the cluster switches made it down here yet.

Never put your DC on the second floor if there isn't a second freight elevator. Or elevator in general. Sigh...

John _______________________________________________ Toasters mailing list Toasters@teaparty.net https://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net https://www.teaparty.net/mailman/listinfo/toasters

John Stoffel

18 May 18 May

8:56 p.m.

...

...
...
...
...
"tmac" == tmac tmacmd@gmail.com writes:

tmac> I had something like this a while back. tmac> I ended up just creating a new 1GB volume and the converting it to the root vol (volume make-vsroot)

Since we have a snapmirror of the old rootvol, I was able to promote it once I had broken the copy. So we're ok now, though still fighting other issues.

John

John Stoffel

16 May 16 May

7 p.m.

So the NetApp came up without any major problems, and I can see all the volumes on the two nodes that arrived, but now I need to break the snap mirror relationship (done) and I mount the original volume from the junction path, and mount the mirror onto that new path. Not working well, since the two nodes holding that volume are offline.

I suspect I will have to just delete those nodes from the cluster, which is scary. And accept that when those nodes come back down I will have to initialize them and re-add them into the cluster.

John

Sent from my iPhone

...

On May 15, 2021, at 6:52 PM, John Stoffel john@stoffel.org wrote:

Hi all,

We're in the middle of hell, where our 4-node FAS8060 cluster was shutdown cleanly for a move, but only one pair made it onto the truck to the new DC. Luckily I have all the volumes snapmirrored between the two pairs of nodes and their aggregates.

But now I need to bring up the pair that made the trip, figure out which mirrors are source and which are destination on this pair, and then break the destination ones so I can promote them to read-write.

This is not something I've practiced, and I wonder that if I have volume foo, mounted on /foo, and it's snapmirror is volume foo_sm, when I do the break, will it automatically mount to /foo? I guess I'll find out later tonight, and I can just unmount and remount.

I think this is all good with just a simple 'snapmirror break ...' but then when we get the chance to rejoin the other two nodes into the cluster down the line, I would asusme I just have to (maybe) wipe the old nodes and rejoin them one at a time. Mostly because by that point I can't have the original source volumes come up and cause us to lose all the writes that have happened on the now writeable destination volumes.

And of course there's the matter of getting epsilon back up and working on the two node cluster when I reboot it. Along with all the LIFs, etc. Not going to be a fun time. Not at all...

And of course we're out of support with Netapp. Sigh...

And who knows if the pair that came down won't lose some disks and end up losing one or more aggregates as well. Stressful times for sure.

So I'm just venting here, but any suggestions or tricks would be helpful.

And of course I'm not sure if the cluster switches made it down here yet.

Never put your DC on the second floor if there isn't a second freight elevator. Or elevator in general. Sigh...

John _______________________________________________ Toasters mailing list Toasters@teaparty.net https://www.teaparty.net/mailman/listinfo/toasters

1626

Age (days ago)

1628

Last active (days ago)

toasters@lists.teaparty.net

9 comments

5 participants

tags (0)

participants (5)

andrei.borzenkov＠fujitsu.com
John Stoffel
Sebastian Goetze
Timothy Naple
tmac