Cluster mode root volume recovery

List overview All Threads
Download

newer

older

Aftermarket Hardware Support?

Script to analizyse over...

Mike Thompson

15 Oct 2013 15 Oct '13

8:47 p.m.

Hey all,

I've got a 8.0.2 c-mode cluster that recently had a single node joined to it, and a few empty aggregates created on it. We had a extended power outtage that required a lot of gear in the data center to get shut down, and since this node in the cluster didn't have any live data or VIFs on it, it got shut down.

A few days later, and we are now powering it up, but get this upon login:

"The contents of the root volume may have changed and the local management databases may be out of sync with the replicated databases due to corruption of NVLOG data during takeover. This node is not fully operational. Contact support personnel for the root volume recovery procedures."

The node comes up fine, can see all it's aggregates, and the other nodes in the cluster can see it via the cluster network, but the node is indeed not fully functional and part of the cluster again. It's aggregates and other info are not visible from the other nodes in the cluster.

Did a wafl_check of the root aggr and vol0 and that came back clean. I seem to recall having been through this before in the past, but can't find anything in my notes.

This particular cluster is not under support, due to some genius decisions by management, so I'm on my own with this.

There are a few empty aggregates on this node, no volumes other than the root vol. Maybe I can force unjoin it from the cluster and rebuild it? Would rather not try to do that. If there is a way to sync up the dbs on the rootvol so it will come back into the cluster, that would be ideal.

Any ideas?

Attachments:

attachment.html (text/html — 1.7 KB)

Show replies by date

tmac

15 Oct 15 Oct

8:53 p.m.

Adding a single node? Um... Unsupported to start with.

8.2 only supports a single node or up to 24 nodes in pairs of same nodes.

--tmac

*Tim McCarthy* *Principal Consultant*

Clustered ONTAP Clustered ONTAP NCDA ID: XK7R3GEKC1QQ2LVD RHCE6 110-107-141https://www.redhat.com/wapps/training/certification/verify.html?certNumber=110-107-141&isSearch=False&verify=Verify NCSIE ID: C14QPHE21FR4YWD4 Expires: 08 November 2014 Current until Aug 02, 2016 Expires: 08 November 2014

On Tue, Oct 15, 2013 at 3:47 PM, Mike Thompson mike.thompson@gmail.comwrote:

...

Hey all,

I've got a 8.0.2 c-mode cluster that recently had a single node joined to it, and a few empty aggregates created on it. We had a extended power outtage that required a lot of gear in the data center to get shut down, and since this node in the cluster didn't have any live data or VIFs on it, it got shut down.

A few days later, and we are now powering it up, but get this upon login:

"The contents of the root volume may have changed and the local management databases may be out of sync with the replicated databases due to corruption of NVLOG data during takeover. This node is not fully operational. Contact support personnel for the root volume recovery procedures."

The node comes up fine, can see all it's aggregates, and the other nodes in the cluster can see it via the cluster network, but the node is indeed not fully functional and part of the cluster again. It's aggregates and other info are not visible from the other nodes in the cluster.

Did a wafl_check of the root aggr and vol0 and that came back clean. I seem to recall having been through this before in the past, but can't find anything in my notes.

This particular cluster is not under support, due to some genius decisions by management, so I'm on my own with this.

There are a few empty aggregates on this node, no volumes other than the root vol. Maybe I can force unjoin it from the cluster and rebuild it? Would rather not try to do that. If there is a way to sync up the dbs on the rootvol so it will come back into the cluster, that would be ideal.

Any ideas?

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Mike Thompson

8:55 p.m.

nodes get added one at a time

this one has a partner, it just hadn't gotten added to the cluster yet.

On Tue, Oct 15, 2013 at 1:53 PM, tmac tmacmd@gmail.com wrote:

...

Adding a single node? Um... Unsupported to start with.

8.2 only supports a single node or up to 24 nodes in pairs of same nodes.

--tmac

*Tim McCarthy* *Principal Consultant*
    Clustered ONTAP
 Clustered ONTAP
NCDA ID: XK7R3GEKC1QQ2LVD RHCE6 110-107-141https://www.redhat.com/wapps/training/certification/verify.html?certNumber=110-107-141&isSearch=False&verify=Verify NCSIE ID: C14QPHE21FR4YWD4 Expires: 08 November 2014 Current until Aug 02, 2016 Expires: 08 November 2014

On Tue, Oct 15, 2013 at 3:47 PM, Mike Thompson mike.thompson@gmail.comwrote:

...
Hey all,

I've got a 8.0.2 c-mode cluster that recently had a single node joined to it, and a few empty aggregates created on it. We had a extended power outtage that required a lot of gear in the data center to get shut down, and since this node in the cluster didn't have any live data or VIFs on it, it got shut down.

A few days later, and we are now powering it up, but get this upon login:

"The contents of the root volume may have changed and the local management databases may be out of sync with the replicated databases due to corruption of NVLOG data during takeover. This node is not fully operational. Contact support personnel for the root volume recovery procedures."

The node comes up fine, can see all it's aggregates, and the other nodes in the cluster can see it via the cluster network, but the node is indeed not fully functional and part of the cluster again. It's aggregates and other info are not visible from the other nodes in the cluster.

Did a wafl_check of the root aggr and vol0 and that came back clean. I seem to recall having been through this before in the past, but can't find anything in my notes.

This particular cluster is not under support, due to some genius decisions by management, so I'm on my own with this.

There are a few empty aggregates on this node, no volumes other than the root vol. Maybe I can force unjoin it from the cluster and rebuild it? Would rather not try to do that. If there is a way to sync up the dbs on the rootvol so it will come back into the cluster, that would be ideal.

Any ideas?

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

tmac

8:59 p.m.

Ouch... weird state

If there is truly nothing on that one node...I would try to unjoin it somehow.

What do the other nodes think?

--tmac

*Tim McCarthy* *Principal Consultant*

On Tue, Oct 15, 2013 at 3:55 PM, Mike Thompson mike.thompson@gmail.comwrote:

...

nodes get added one at a time

this one has a partner, it just hadn't gotten added to the cluster yet.

On Tue, Oct 15, 2013 at 1:53 PM, tmac tmacmd@gmail.com wrote:

...
Adding a single node? Um... Unsupported to start with.

8.2 only supports a single node or up to 24 nodes in pairs of same nodes.

--tmac

*Tim McCarthy* *Principal Consultant*
    Clustered ONTAP
   Clustered ONTAP
NCDA ID: XK7R3GEKC1QQ2LVD RHCE6 110-107-141https://www.redhat.com/wapps/training/certification/verify.html?certNumber=110-107-141&isSearch=False&verify=Verify NCSIE ID: C14QPHE21FR4YWD4 Expires: 08 November 2014 Current until Aug 02, 2016 Expires: 08 November 2014

On Tue, Oct 15, 2013 at 3:47 PM, Mike Thompson mike.thompson@gmail.comwrote:

...
Hey all,

I've got a 8.0.2 c-mode cluster that recently had a single node joined to it, and a few empty aggregates created on it. We had a extended power outtage that required a lot of gear in the data center to get shut down, and since this node in the cluster didn't have any live data or VIFs on it, it got shut down.

A few days later, and we are now powering it up, but get this upon login:

"The contents of the root volume may have changed and the local management databases may be out of sync with the replicated databases due to corruption of NVLOG data during takeover. This node is not fully operational. Contact support personnel for the root volume recovery procedures."

The node comes up fine, can see all it's aggregates, and the other nodes in the cluster can see it via the cluster network, but the node is indeed not fully functional and part of the cluster again. It's aggregates and other info are not visible from the other nodes in the cluster.

Did a wafl_check of the root aggr and vol0 and that came back clean. I seem to recall having been through this before in the past, but can't find anything in my notes.

This particular cluster is not under support, due to some genius decisions by management, so I'm on my own with this.

There are a few empty aggregates on this node, no volumes other than the root vol. Maybe I can force unjoin it from the cluster and rebuild it? Would rather not try to do that. If there is a way to sync up the dbs on the rootvol so it will come back into the cluster, that would be ideal.

Any ideas?

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Mike Thompson

9:05 p.m.

the other nodes see it, but say it's not healthy ;)

filer1a::cluster*> show Node Health Eligibility Epsilon -------------------- ------- ------------ ------------ filer1a true true true filer1b true true false filer2a true true false filer2b true true false filer3a false true false filer4a true true false filer4b true true false filer5a true true false filer5b true true false

Not sure if a node that has aggregates (even if they are empty) can be unjoined or not. I can rig up a test for that, but it'll take a while.

I'm thinking there is some command(s) to tell it to pull over the rdb/whatnot it is upset about from the other nodes in the cluster, which I'm sure support would clue me in on (I see mention in the NetApp communities forums of someone going through the same thing, and support sorted them out) but unfortunately I can't call them about this particular cluster.

On Tue, Oct 15, 2013 at 1:59 PM, tmac tmacmd@gmail.com wrote:

...

Ouch... weird state

If there is truly nothing on that one node...I would try to unjoin it somehow.

What do the other nodes think?

--tmac

*Tim McCarthy* *Principal Consultant*
    Clustered ONTAP
 Clustered ONTAP
NCDA ID: XK7R3GEKC1QQ2LVD RHCE6 110-107-141https://www.redhat.com/wapps/training/certification/verify.html?certNumber=110-107-141&isSearch=False&verify=Verify NCSIE ID: C14QPHE21FR4YWD4 Expires: 08 November 2014 Current until Aug 02, 2016 Expires: 08 November 2014

On Tue, Oct 15, 2013 at 3:55 PM, Mike Thompson mike.thompson@gmail.comwrote:

...
nodes get added one at a time

this one has a partner, it just hadn't gotten added to the cluster yet.

On Tue, Oct 15, 2013 at 1:53 PM, tmac tmacmd@gmail.com wrote:

...
Adding a single node? Um... Unsupported to start with.

8.2 only supports a single node or up to 24 nodes in pairs of same nodes.

--tmac

*Tim McCarthy* *Principal Consultant*
    Clustered ONTAP
   Clustered ONTAP
NCDA ID: XK7R3GEKC1QQ2LVD RHCE6 110-107-141https://www.redhat.com/wapps/training/certification/verify.html?certNumber=110-107-141&isSearch=False&verify=Verify NCSIE ID: C14QPHE21FR4YWD4 Expires: 08 November 2014 Current until Aug 02, 2016 Expires: 08 November 2014

On Tue, Oct 15, 2013 at 3:47 PM, Mike Thompson mike.thompson@gmail.comwrote:

...
Hey all,

I've got a 8.0.2 c-mode cluster that recently had a single node joined to it, and a few empty aggregates created on it. We had a extended power outtage that required a lot of gear in the data center to get shut down, and since this node in the cluster didn't have any live data or VIFs on it, it got shut down.

A few days later, and we are now powering it up, but get this upon login:

"The contents of the root volume may have changed and the local management databases may be out of sync with the replicated databases due to corruption of NVLOG data during takeover. This node is not fully operational. Contact support personnel for the root volume recovery procedures."

The node comes up fine, can see all it's aggregates, and the other nodes in the cluster can see it via the cluster network, but the node is indeed not fully functional and part of the cluster again. It's aggregates and other info are not visible from the other nodes in the cluster.

Did a wafl_check of the root aggr and vol0 and that came back clean. I seem to recall having been through this before in the past, but can't find anything in my notes.

This particular cluster is not under support, due to some genius decisions by management, so I'm on my own with this.

There are a few empty aggregates on this node, no volumes other than the root vol. Maybe I can force unjoin it from the cluster and rebuild it? Would rather not try to do that. If there is a way to sync up the dbs on the rootvol so it will come back into the cluster, that would be ideal.

Any ideas?

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Scott Miller

9:09 p.m.

try:

set -priv diag volume add-other-volumes

which claims it's used to import 7-mode volumes, but I've used it to re-sync the cluster volume database when a root volume went walk-about.

*and*

also in priv mode:

volume lost-found show to see what the cluster volume DB things might be missing.

These commands are in cDOT 8.2P3, not sure in earlier versions.

-skottie

On 10/15/2013 01:47 PM, Mike Thompson wrote:

...

Hey all,

I've got a 8.0.2 c-mode cluster that recently had a single node joined to it, and a few empty aggregates created on it. We had a extended power outtage that required a lot of gear in the data center to get shut down, and since this node in the cluster didn't have any live data or VIFs on it, it got shut down.

A few days later, and we are now powering it up, but get this upon login:

"The contents of the root volume may have changed and the local management databases may be out of sync with the replicated databases due to corruption of NVLOG data during takeover. This node is not fully operational. Contact support personnel for the root volume recovery procedures."

The node comes up fine, can see all it's aggregates, and the other nodes in the cluster can see it via the cluster network, but the node is indeed not fully functional and part of the cluster again. It's aggregates and other info are not visible from the other nodes in the cluster.

Did a wafl_check of the root aggr and vol0 and that came back clean. I seem to recall having been through this before in the past, but can't find anything in my notes.

This particular cluster is not under support, due to some genius decisions by management, so I'm on my own with this.

There are a few empty aggregates on this node, no volumes other than the root vol. Maybe I can force unjoin it from the cluster and rebuild it? Would rather not try to do that. If there is a way to sync up the dbs on the rootvol so it will come back into the cluster, that would be ideal.

Any ideas?

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Mike Thompson

9:39 p.m.

Hey Skottie!

Good to hear from you - been a while!

Unfortunately, though those commands are available in this ancient 8.0.2 release we are running on this cluster, they cannot be run in the state that this particular node is in.

It's in a state kind of like prior to joining or creating a cluster. There are no volume commands available to be run (even if i explicitly type them out) It's got the cluster join, cluster create, ping-cluster commands available - it's like it's orphaned itself from the rest of the cluster. Though if I ping-cluster it sees and can connect to all the other nodes fine.

I'm tempted to try and rejoin it to the cluster via 'cluster join' again, but don't want to possibly screw up the rest of the cluster.

Thanks for the help!

On Tue, Oct 15, 2013 at 2:09 PM, Scott Miller Scott.Miller@dreamworks.comwrote:

...

try:

set -priv diag volume add-other-volumes

which claims it's used to import 7-mode volumes, but I've used it to re-sync the cluster volume database when a root volume went walk-about.

*and*

also in priv mode:

volume lost-found show to see what the cluster volume DB things might be missing.

These commands are in cDOT 8.2P3, not sure in earlier versions.

-skottie

On 10/15/2013 01:47 PM, Mike Thompson wrote:

...
Hey all,

I've got a 8.0.2 c-mode cluster that recently had a single node joined to it, and a few empty aggregates created on it. We had a extended power outtage that required a lot of gear in the data center to get shut down, and since this node in the cluster didn't have any live data or VIFs on it, it got shut down.

A few days later, and we are now powering it up, but get this upon login:

"The contents of the root volume may have changed and the local management databases may be out of sync with the replicated databases due to corruption of NVLOG data during takeover. This node is not fully operational. Contact support personnel for the root volume recovery procedures."

The node comes up fine, can see all it's aggregates, and the other nodes in the cluster can see it via the cluster network, but the node is indeed not fully functional and part of the cluster again. It's aggregates and other info are not visible from the other nodes in the cluster.

Did a wafl_check of the root aggr and vol0 and that came back clean. I seem to recall having been through this before in the past, but can't find anything in my notes.

This particular cluster is not under support, due to some genius decisions by management, so I'm on my own with this.

There are a few empty aggregates on this node, no volumes other than the root vol. Maybe I can force unjoin it from the cluster and rebuild it? Would rather not try to do that. If there is a way to sync up the dbs on the rootvol so it will come back into the cluster, that would be ideal.

Any ideas?

______________________________**_________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/**mailman/listinfo/toasters http://www.teaparty.net/mailman/listinfo/toasters

April Jenner

10:35 p.m.

Hello Mike:

You might also need to do the following to clear the flags that caused the node to enter the root volume recover mode. I would highly suggest an upgrade.

--April

1. Check to see if the bootarg.init.boot_recovery bit is set. From the FreeBSD prompt of your node, type: * kenv bootarg.init.boot_recovery 2. If a value is returned, and not "kenv: unable to get bootarg.init.boot_recovery", then clear the bit. From the FreeBSD prompt of your node, type: * sudo sysctl kern.bootargs=--bootarg.init.boot_recovery 3. Check to see if the bootarg.rdb_corrupt.mgwd is set. From the FreeBSD prompt of your node, type: * kenv bootarg.rdb_corrupt.mgwd 4. If true is returned, and not "kenv: unable to get bootarg.rdb_corrupt.mgwd"', then clear the bit. From the FreeBSD prompt of your node, type: * sudo kenv bootarg.rdb_corrupt.mgwd="false" 5. Check to see if the monitor_mroot.nvfail file exists. From the FreeBSD prompt of your node, type: * ls /mroot/etc/cluster_config/monitor_mroot.nvfail 6. If the file exists, and you don't get "No such file or directory", then remove it. From the FreeBSD of your node, type: * sudo rm /mroot/etc/cluster_config/monitor_mroot.nvfai

On Tuesday, October 15, 2013 2:52 PM, Mike Thompson mike.thompson@gmail.com wrote:

Hey Skottie!

Good to hear from you - been a while!

Unfortunately, though those commands are available in this ancient 8.0.2 release we are running on this cluster, they cannot be run in the state that this particular node is in.

I'm tempted to try and rejoin it to the cluster via 'cluster join' again, but don't want to possibly screw up the rest of the cluster.

Thanks for the help!

On Tue, Oct 15, 2013 at 2:09 PM, Scott Miller Scott.Miller@dreamworks.com wrote:

...

try:

set -priv diag volume add-other-volumes

which claims it's used to import 7-mode volumes, but I've used it to re-sync the cluster volume database when a root volume went walk-about.

*and*

also in priv mode:

volume lost-found show to see what the cluster volume DB things might be missing.

These commands are in cDOT 8.2P3, not sure in earlier versions.

-skottie

On 10/15/2013 01:47 PM, Mike Thompson wrote:

Hey all,

...
I've got a 8.0.2 c-mode cluster that recently had a single node joined to it, and a few empty aggregates created on it. We had a extended power outtage that required a lot of gear in the data center to get shut down, and since this node in the cluster didn't have any live data or VIFs on it, it got shut down.

A few days later, and we are now powering it up, but get this upon login:

"The contents of the root volume may have changed and the local management databases may be out of sync with the replicated databases due to corruption of NVLOG data during takeover. This node is not fully operational. Contact support personnel for the root volume recovery procedures."

The node comes up fine, can see all it's aggregates, and the other nodes in the cluster can see it via the cluster network, but the node is indeed not fully functional and part of the cluster again. It's aggregates and other info are not visible from the other nodes in the cluster.

Did a wafl_check of the root aggr and vol0 and that came back clean. I seem to recall having been through this before in the past, but can't find anything in my notes.

This particular cluster is not under support, due to some genius decisions by management, so I'm on my own with this.

There are a few empty aggregates on this node, no volumes other than the root vol. Maybe I can force unjoin it from the cluster and rebuild it? Would rather not try to do that. If there is a way to sync up the dbs on the rootvol so it will come back into the cluster, that would be ideal.

Any ideas?

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Mike Thompson

16 Oct 16 Oct

12:18 a.m.

Hi April,

This last step, removing /mroot/etc/cluster_config/monitor_mroot.nvfail and a reboot did the trick!

All nodes/aggrs showing up properly, and syslog looks clean.

Thanks very very much!

On Tue, Oct 15, 2013 at 3:35 PM, April Jenner aprilogi@yahoo.com wrote:

...

Hello Mike:

You might also need to do the following to clear the flags that caused the node to enter the root volume recover mode. I would highly suggest an upgrade.

--April

Check to see if the bootarg.init.boot_recovery bit is set. From the

FreeBSD prompt of your node, type: - *kenv bootarg.init.boot_recovery* 2. If a value is returned, and not "kenv: unable to get bootarg.init.boot_recovery", then clear the bit. From the FreeBSD prompt of your node, type: - *sudo sysctl kern.bootargs=--bootarg.init.boot_recovery* 3. Check to see if the bootarg.rdb_corrupt.mgwd is set. From the FreeBSD prompt of your node, type: - *kenv bootarg.rdb_corrupt.mgwd* 4. If *true* is returned, and not "kenv: unable to get bootarg.rdb_corrupt.mgwd"', then clear the bit. From the FreeBSD prompt of your node, type: - *sudo kenv bootarg.rdb_corrupt.mgwd="false"* 5. Check to see if the monitor_mroot.nvfail file exists. From the FreeBSD prompt of your node, type: - *ls /mroot/etc/cluster_config/monitor_mroot.nvfail* 6. If the file exists, and you don't get "No such file or directory", then remove it. From the FreeBSD of your node, type: - *sudo rm /mroot/etc/cluster_config/monitor_mroot.nvfai*

On Tuesday, October 15, 2013 2:52 PM, Mike Thompson < mike.thompson@gmail.com> wrote: Hey Skottie!

Good to hear from you - been a while!

Unfortunately, though those commands are available in this ancient 8.0.2 release we are running on this cluster, they cannot be run in the state that this particular node is in.

It's in a state kind of like prior to joining or creating a cluster. There are no volume commands available to be run (even if i explicitly type them out) It's got the cluster join, cluster create, ping-cluster commands available - it's like it's orphaned itself from the rest of the cluster. Though if I ping-cluster it sees and can connect to all the other nodes fine.

I'm tempted to try and rejoin it to the cluster via 'cluster join' again, but don't want to possibly screw up the rest of the cluster.

Thanks for the help!

On Tue, Oct 15, 2013 at 2:09 PM, Scott Miller <Scott.Miller@dreamworks.com

...
wrote:

try:

set -priv diag volume add-other-volumes

which claims it's used to import 7-mode volumes, but I've used it to re-sync the cluster volume database when a root volume went walk-about.

*and*

also in priv mode:

volume lost-found show to see what the cluster volume DB things might be missing.

These commands are in cDOT 8.2P3, not sure in earlier versions.

-skottie

On 10/15/2013 01:47 PM, Mike Thompson wrote:

Hey all,

I've got a 8.0.2 c-mode cluster that recently had a single node joined to it, and a few empty aggregates created on it. We had a extended power outtage that required a lot of gear in the data center to get shut down, and since this node in the cluster didn't have any live data or VIFs on it, it got shut down.

A few days later, and we are now powering it up, but get this upon login:

"The contents of the root volume may have changed and the local management databases may be out of sync with the replicated databases due to corruption of NVLOG data during takeover. This node is not fully operational. Contact support personnel for the root volume recovery procedures."

The node comes up fine, can see all it's aggregates, and the other nodes in the cluster can see it via the cluster network, but the node is indeed not fully functional and part of the cluster again. It's aggregates and other info are not visible from the other nodes in the cluster.

Did a wafl_check of the root aggr and vol0 and that came back clean. I seem to recall having been through this before in the past, but can't find anything in my notes.

This particular cluster is not under support, due to some genius decisions by management, so I'm on my own with this.

There are a few empty aggregates on this node, no volumes other than the root vol. Maybe I can force unjoin it from the cluster and rebuild it? Would rather not try to do that. If there is a way to sync up the dbs on the rootvol so it will come back into the cluster, that would be ideal.

Any ideas?

______________________________**_________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/**mailman/listinfo/toasters http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Duncan Cummings

1:19 a.m.

Mike,

You should upgrade if possible.

In 8.1 there is a command

system configuration recovery cluster sync -node node2

which is used for synchronizing a node with a cluster.

Would have made life a lot simpler ☺

Duncan

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mike Thompson Sent: Wednesday, 16 October 2013 10:18 AM To: April Jenner Cc: Scott Miller; toasters@teaparty.net Lists Subject: Re: Cluster mode root volume recovery

Hi April, This last step, removing /mroot/etc/cluster_config/monitor_mroot.nvfail and a reboot did the trick!

All nodes/aggrs showing up properly, and syslog looks clean.

Thanks very very much!

On Tue, Oct 15, 2013 at 3:35 PM, April Jenner <aprilogi@yahoo.commailto:aprilogi@yahoo.com> wrote: Hello Mike:

You might also need to do the following to clear the flags that caused the node to enter the root volume recover mode. I would highly suggest an upgrade.

--April

1. Check to see if the bootarg.init.boot_recovery bit is set. From the FreeBSD prompt of your node, type: § kenv bootarg.init.boot_recovery 2. If a value is returned, and not "kenv: unable to get bootarg.init.boot_recovery", then clear the bit. From the FreeBSD prompt of your node, type: § sudo sysctl kern.bootargs=--bootarg.init.boot_recovery 3. Check to see if the bootarg.rdb_corrupt.mgwd is set. From the FreeBSD prompt of your node, type: § kenv bootarg.rdb_corrupt.mgwd 4. If true is returned, and not "kenv: unable to get bootarg.rdb_corrupt.mgwd"', then clear the bit. From the FreeBSD prompt of your node, type: § sudo kenv bootarg.rdb_corrupt.mgwd="false" 5. Check to see if the monitor_mroot.nvfail file exists. From the FreeBSD prompt of your node, type: § ls /mroot/etc/cluster_config/monitor_mroot.nvfail 6. If the file exists, and you don't get "No such file or directory", then remove it. From the FreeBSD of your node, type: § sudo rm /mroot/etc/cluster_config/monitor_mroot.nvfai

On Tuesday, October 15, 2013 2:52 PM, Mike Thompson <mike.thompson@gmail.commailto:mike.thompson@gmail.com> wrote: Hey Skottie! Good to hear from you - been a while! Unfortunately, though those commands are available in this ancient 8.0.2 release we are running on this cluster, they cannot be run in the state that this particular node is in. It's in a state kind of like prior to joining or creating a cluster. There are no volume commands available to be run (even if i explicitly type them out) It's got the cluster join, cluster create, ping-cluster commands available - it's like it's orphaned itself from the rest of the cluster. Though if I ping-cluster it sees and can connect to all the other nodes fine. I'm tempted to try and rejoin it to the cluster via 'cluster join' again, but don't want to possibly screw up the rest of the cluster. Thanks for the help!

On Tue, Oct 15, 2013 at 2:09 PM, Scott Miller <Scott.Miller@dreamworks.commailto:Scott.Miller@dreamworks.com> wrote:

try:

set -priv diag volume add-other-volumes

which claims it's used to import 7-mode volumes, but I've used it to re-sync the cluster volume database when a root volume went walk-about.

*and*

also in priv mode:

volume lost-found show to see what the cluster volume DB things might be missing.

These commands are in cDOT 8.2P3, not sure in earlier versions.

-skottie

On 10/15/2013 01:47 PM, Mike Thompson wrote: Hey all,

A few days later, and we are now powering it up, but get this upon login:

Did a wafl_check of the root aggr and vol0 and that came back clean. I seem to recall having been through this before in the past, but can't find anything in my notes.

This particular cluster is not under support, due to some genius decisions by management, so I'm on my own with this.

Any ideas?

Duncan Cummings NetApp Specialist Interactive Pty Ltd Telephone 07 3323 0800 Facsimile 07 3323 0899 Mobile 0403 383 050 www.interactive.com.auhttp://www.interactive.com.au

-------Confidentiality & Legal Privilege------------- "This email is intended for the named recipient only. The information contained in this message may be confidential, or commercially sensitive. If you are not the intended recipient you must not reproduce or distribute any part of the email, disclose its contents to any other party, or take any action in reliance on it. If you have received this email in error, please contact the sender immediately. Please delete this message from your computer. Confidentiality and legal privilege are not waived or lost by reason of mistaken delivery to you." _______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

April Jenner

3:17 a.m.

Hello Mike:

Glad to hear you got it working. It was hard to know exactly what you needed but at least the the earlier commands didn't cause any harm.

Definitely worth upgrading to DOT 8.2 if you can.

--April

On Tuesday, October 15, 2013 6:19 PM, Duncan Cummings dcummings@interactive.com.au wrote:

Mike, You should upgrade if possible. In 8.1 there is a command system configuration recovery cluster sync -node node2 which is used for synchronizing a node with a cluster. Would have made life a lot simpler J Duncan From:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mike Thompson Sent: Wednesday, 16 October 2013 10:18 AM To: April Jenner Cc: Scott Miller; toasters@teaparty.net Lists Subject: Re: Cluster mode root volume recovery Hi April, This last step, removing /mroot/etc/cluster_config/monitor_mroot.nvfail and a reboot did the trick!

All nodes/aggrs showing up properly, and syslog looks clean.

Thanks very very much! On Tue, Oct 15, 2013 at 3:35 PM, April Jenner aprilogi@yahoo.com wrote: Hello Mike: You might also need to do the following to clear the flags that caused the node to enter the root volume recover mode. I would highly suggest an upgrade. --April 1. Check to see if the bootarg.init.boot_recovery bit is set. From the FreeBSD prompt of your node, type: § kenv bootarg.init.boot_recovery 2. If a value is returned, and not "kenv: unable to get bootarg.init.boot_recovery", then clear the bit. From the FreeBSD prompt of your node, type: § sudo sysctl kern.bootargs=--bootarg.init.boot_recovery 3. Check to see if the bootarg.rdb_corrupt.mgwd is set. From the FreeBSD prompt of your node, type: § kenv bootarg.rdb_corrupt.mgwd 4. If true is returned, and not "kenv: unable to get bootarg.rdb_corrupt.mgwd"', then clear the bit. >From the FreeBSD prompt of your node, type: § sudo kenv bootarg.rdb_corrupt.mgwd="false" 5. Check to see if the monitor_mroot.nvfail file exists. From the FreeBSD prompt of your node, type: § ls /mroot/etc/cluster_config/monitor_mroot.nvfail 6. If the file exists, and you don't get "No such file or directory", then remove it. From the FreeBSD of your node, type: § sudo rm /mroot/etc/cluster_config/monitor_mroot.nvfai On Tuesday, October 15, 2013 2:52 PM, Mike Thompson mike.thompson@gmail.com wrote: Hey Skottie! Good to hear from you - been a while! Unfortunately, though those commands are available in this ancient 8.0.2 release we are running on this cluster, they cannot be run in the state that this particular node is in. It's in a state kind of like prior to joining or creating a cluster. There are no volume commands available to be run (even if i explicitly type them out) It's got the cluster join, cluster create, ping-cluster commands available - it's like it's orphaned itself from the rest of the cluster. Though if I ping-cluster it sees and can connect to all the other nodes fine. I'm tempted to try and rejoin it to the cluster via 'cluster join' again, but don't want to possibly screw up the rest of the cluster. Thanks for the help! On Tue, Oct 15, 2013 at 2:09 PM, Scott Miller Scott.Miller@dreamworks.com wrote:

try:

set -priv diag volume add-other-volumes

which claims it's used to import 7-mode volumes, but I've used it to re-sync the cluster volume database when a root volume went walk-about.

*and*

also in priv mode:

volume lost-found show to see what the cluster volume DB things might be missing.

These commands are in cDOT 8.2P3, not sure in earlier versions.

-skottie

On 10/15/2013 01:47 PM, Mike Thompson wrote: Hey all,

...

I've got a 8.0.2 c-mode cluster that recently had a single node joined to it, and a few empty aggregates created on it. We had a extended power outtage that required a lot of gear in the data center to get shut down, and since this node in the cluster didn't have any live data or VIFs on it, it got shut down.

A few days later, and we are now powering it up, but get this upon login:

"The contents of the root volume may have changed and the local management databases may be out of sync with the replicated databases due to corruption of NVLOG data during takeover. This node is not fully operational. Contact support personnel for the root volume recovery procedures."

The node comes up fine, can see all it's aggregates, and the other nodes in the cluster can see it via the cluster network, but the node is indeed not fully functional and part of the cluster again. It's aggregates and other info are not visible from the other nodes in the cluster.

Did a wafl_check of the root aggr and vol0 and that came back clean. I seem to recall having been through this before in the past, but can't find anything in my notes.

This particular cluster is not under support, due to some genius decisions by management, so I'm on my own with this.

There are a few empty aggregates on this node, no volumes other than the root vol. Maybe I can force unjoin it from the cluster and rebuild it? Would rather not try to do that. If there is a way to sync up the dbs on the rootvol so it will come back into the cluster, that would be ideal.

Any ideas?

Duncan Cummings NetApp Specialist Interactive Pty Ltd Telephone 07 3323 0800 Facsimile 07 3323 0899 Mobile 0403 383 050 www.interactive.com.au

...

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Andrew Werchowiecki

17 Oct 17 Oct

8:27 a.m.

Oh neat, didn't realise odd number clusters were now supported!

Andy

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of April Jenner Sent: Wednesday, 16 October 2013 11:18 AM To: Duncan Cummings; Mike Thompson Cc: Scott Miller; toasters@teaparty.net Lists Subject: Re: Cluster mode root volume recovery

Hello Mike:

Glad to hear you got it working. It was hard to know exactly what you needed but at least the the earlier commands didn't cause any harm.

Definitely worth upgrading to DOT 8.2 if you can.

--April

On Tuesday, October 15, 2013 6:19 PM, Duncan Cummings <dcummings@interactive.com.aumailto:dcummings@interactive.com.au> wrote: Mike,

You should upgrade if possible.

In 8.1 there is a command

system configuration recovery cluster sync -node node2

which is used for synchronizing a node with a cluster.

Would have made life a lot simpler :)

Duncan

From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Mike Thompson Sent: Wednesday, 16 October 2013 10:18 AM To: April Jenner Cc: Scott Miller; toasters@teaparty.netmailto:toasters@teaparty.net Lists Subject: Re: Cluster mode root volume recovery

Hi April, This last step, removing /mroot/etc/cluster_config/monitor_mroot.nvfail and a reboot did the trick!

All nodes/aggrs showing up properly, and syslog looks clean.

Thanks very very much!

On Tue, Oct 15, 2013 at 3:35 PM, April Jenner <aprilogi@yahoo.commailto:aprilogi@yahoo.com> wrote: Hello Mike:

You might also need to do the following to clear the flags that caused the node to enter the root volume recover mode. I would highly suggest an upgrade.

--April

On Tue, Oct 15, 2013 at 2:09 PM, Scott Miller <Scott.Miller@dreamworks.commailto:Scott.Miller@dreamworks.com> wrote:

try:

set -priv diag volume add-other-volumes

which claims it's used to import 7-mode volumes, but I've used it to re-sync the cluster volume database when a root volume went walk-about.

*and*

also in priv mode:

volume lost-found show to see what the cluster volume DB things might be missing.

These commands are in cDOT 8.2P3, not sure in earlier versions.

-skottie

On 10/15/2013 01:47 PM, Mike Thompson wrote: Hey all,

A few days later, and we are now powering it up, but get this upon login:

Did a wafl_check of the root aggr and vol0 and that came back clean. I seem to recall having been through this before in the past, but can't find anything in my notes.

This particular cluster is not under support, due to some genius decisions by management, so I'm on my own with this.

Any ideas?

Duncan Cummings NetApp Specialist Interactive Pty Ltd Telephone 07 3323 0800 Facsimile 07 3323 0899 Mobile 0403 383 050 www.interactive.com.auhttp://www.interactive.com.au/

_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

tmac

10:23 a.m.

Only a single node cluster and only in 8.2 (and forward). You can go the an switchless ha-pair too.

After that, nodes must be installed in pairs and the pairs must be the same filer model.

--tmac

*Tim McCarthy* *Principal Consultant*

On Thu, Oct 17, 2013 at 3:27 AM, Andrew Werchowiecki < Andrew.Werchowiecki@xpanse.com.au> wrote:

...

Oh neat, didn’t realise odd number clusters were now supported!****

Andy****

*From:* toasters-bounces@teaparty.net [mailto: toasters-bounces@teaparty.net] *On Behalf Of *April Jenner *Sent:* Wednesday, 16 October 2013 11:18 AM *To:* Duncan Cummings; Mike Thompson

*Cc:* Scott Miller; toasters@teaparty.net Lists *Subject:* Re: Cluster mode root volume recovery****

Hello Mike:

Glad to hear you got it working. It was hard to know exactly what you needed but at least the the earlier commands didn't cause any harm.

Definitely worth upgrading to DOT 8.2 if you can.

--April****

On Tuesday, October 15, 2013 6:19 PM, Duncan Cummings < dcummings@interactive.com.au> wrote:****

Mike,****

You should upgrade if possible.****

In 8.1 there is a command****

system configuration recovery cluster sync -node node2****

which is used for synchronizing a node with a cluster.****

Would have made life a lot simpler J****

Duncan****

*From:* toasters-bounces@teaparty.net [ mailto:toasters-bounces@teaparty.net toasters-bounces@teaparty.net] *On Behalf Of *Mike Thompson *Sent:* Wednesday, 16 October 2013 10:18 AM *To:* April Jenner *Cc:* Scott Miller; toasters@teaparty.net Lists *Subject:* Re: Cluster mode root volume recovery****

Hi April,****

This last step, removing /mroot/etc/cluster_config/monitor_mroot.nvfail and a reboot did the trick!

All nodes/aggrs showing up properly, and syslog looks clean.

Thanks very very much! ****

On Tue, Oct 15, 2013 at 3:35 PM, April Jenner aprilogi@yahoo.com wrote:*

Hello Mike:****

You might also need to do the following to clear the flags that caused the node to enter the root volume recover mode. I would highly suggest an upgrade.****

--April****

Check to see if the bootarg.init.boot_recovery bit is set. From the

FreeBSD prompt of your node, type:****

§ *kenv bootarg.init.boot_recovery*****

If a value is returned, and not "kenv: unable to get

bootarg.init.boot_recovery", then clear the bit. From the FreeBSD prompt of your node, type:****

§ *sudo sysctl kern.bootargs=--bootarg.init.boot_recovery*****

Check to see if the bootarg.rdb_corrupt.mgwd is set. From the

FreeBSD prompt of your node, type:****

§ *kenv bootarg.rdb_corrupt.mgwd*****

If *true* is returned, and not "kenv: unable to get

bootarg.rdb_corrupt.mgwd"', then clear the bit. From the FreeBSD prompt of your node, type:****

§ *sudo kenv bootarg.rdb_corrupt.mgwd="false"*****

Check to see if the monitor_mroot.nvfail file exists. From the

FreeBSD prompt of your node, type:****

§ *ls /mroot/etc/cluster_config/monitor_mroot.nvfail*****

If the file exists, and you don't get "No such file or directory",

then remove it. From the FreeBSD of your node, type:****

§ *sudo rm /mroot/etc/cluster_config/monitor_mroot.nvfai*****

On Tuesday, October 15, 2013 2:52 PM, Mike Thompson < mike.thompson@gmail.com> wrote:****

Hey Skottie!****

Good to hear from you - been a while!****

Unfortunately, though those commands are available in this ancient 8.0.2 release we are running on this cluster, they cannot be run in the state that this particular node is in.****

It's in a state kind of like prior to joining or creating a cluster. There are no volume commands available to be run (even if i explicitly type them out) It's got the cluster join, cluster create, ping-cluster commands available - it's like it's orphaned itself from the rest of the cluster. Though if I ping-cluster it sees and can connect to all the other nodes fine.****

I'm tempted to try and rejoin it to the cluster via 'cluster join' again, but don't want to possibly screw up the rest of the cluster. ****

Thanks for the help!****

On Tue, Oct 15, 2013 at 2:09 PM, Scott Miller Scott.Miller@dreamworks.com wrote:****

try:

set -priv diag volume add-other-volumes

which claims it's used to import 7-mode volumes, but I've used it to re-sync the cluster volume database when a root volume went walk-about.

*and*

also in priv mode:

volume lost-found show to see what the cluster volume DB things might be missing.

These commands are in cDOT 8.2P3, not sure in earlier versions.

-skottie****

On 10/15/2013 01:47 PM, Mike Thompson wrote:****

Hey all,

I've got a 8.0.2 c-mode cluster that recently had a single node joined to it, and a few empty aggregates created on it. We had a extended power outtage that required a lot of gear in the data center to get shut down, and since this node in the cluster didn't have any live data or VIFs on it, it got shut down.

A few days later, and we are now powering it up, but get this upon login:

"The contents of the root volume may have changed and the local management databases may be out of sync with the replicated databases due to corruption of NVLOG data during takeover. This node is not fully operational. Contact support personnel for the root volume recovery procedures."

The node comes up fine, can see all it's aggregates, and the other nodes in the cluster can see it via the cluster network, but the node is indeed not fully functional and part of the cluster again. It's aggregates and other info are not visible from the other nodes in the cluster.

Did a wafl_check of the root aggr and vol0 and that came back clean. I seem to recall having been through this before in the past, but can't find anything in my notes.

This particular cluster is not under support, due to some genius decisions by management, so I'm on my own with this.

There are a few empty aggregates on this node, no volumes other than the root vol. Maybe I can force unjoin it from the cluster and rebuild it? Would rather not try to do that. If there is a way to sync up the dbs on the rootvol so it will come back into the cluster, that would be ideal.

Any ideas?

Duncan Cummings *NetApp Specialist* Interactive Pty Ltd Telephone 07 3323 0800 Facsimile 07 3323 0899 Mobile 0403 383 050 www.interactive.com.au

-------Confidentiality & Legal Privilege------------- "This email is intended for the named recipient only. The information contained in this message may be confidential, or commercially sensitive. If you are not the intended recipient you must not reproduce or distribute any part of the email, disclose its contents to any other party, or take any action in reliance on it. If you have received this email in error, please contact the sender immediately. Please delete this message from your computer. Confidentiality and legal privilege are not waived or lost by reason of mistaken delivery to you." ****

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters****

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters****

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

4420

Age (days ago)

4422

Last active (days ago)

toasters@lists.teaparty.net

12 comments

6 participants

tags (0)

participants (6)

Andrew Werchowiecki
April Jenner
Duncan Cummings
Mike Thompson
Scott Miller
tmac