Hi everybody, past friday during an operation given as NDO we've had a service interruption on NAS component. We had to move the root aggregate from some old disks to new ones and we've literally followed the procedure reported here (our cDOT is 8.3.2P9 on a 4 nodes cluster)
https://kb.netapp.com/app/answers/answer_view/a_id/1030179
In a very simple way it says: A. Check for epsilon on the node you've to migrate and move it to another node A.1 there's a warining about SAN protocols interruptions but we DID NOT have SAN protocols running, only NFS/CIFS. B. Lif migration after the aggregate relocation Well, NFS was restarted and all servers and apps belonging to it went down! I let you imagine customer reaction... Also console after this command: system node modify -node node01 -eligibility false give us a warning about SAN disruption. As I wrote it did not matter us.
Only after that we've found on manual this, but as usual manuals are always less updated than knowledgebase so it could be the last place where to find fresh informations!
https://library.netapp.com/ecmdocs/ECMP1367947/html/GUID-AB52F821-3A25-4E02-...
Moving epsilon for certain manually initiated takeovers Note: Although cluster formation voting can be modified by using the cluster modify -eligibility false command, you should avoid this except for situations such as restoring the node configuration or prolonged node maintenance. If you set a node to be ineligible, it stops serving SAN data until the node is reset to eligible and rebooted. NAS data access to the node might also be affected when the node is ineligible.
And, what does it mean "might be". I translate that as a "nobody knows, try..."
Now the most important question (we must migrate other three nodes!) is this: Assuming that we've well understood that 1. migrate lif and only 2. epsilon false, it there an official answer/doc with updated information that ensure that is this the right procedure to avoid also NAS protocols interruption?
Thank you very much,
Dott. Giacomo Milazzo Senior Consultant & Technical Account Manager mobile: +39 340.6001045 @-mail: g.milazzo@sinergy.it Web: http://www.sinergy.it
SINERGY SpA Viale dei Santi Pietro e Paolo 50 00144 - Roma RM Tel. +39 06 44243674 Fax +39 06 44245272
Hmmmmm,
Going through the steps in the KB, I would have done the epsilon and eligibility steps (Step 1 in the KB) right before the reboot (Step 8), *after* moving the aggregate and the LIFs away from the node to be worked on.
At that point in time it shouldn't disturb anything, since no user traffic should pass through this nodes interfaces or disks.
What do you think? (I'm a little unclear about the meaning of "NFS was restarted", but I have a feeling the above change in sequence should help)
Also, if you look at the revert steps, the KB first restores eligibility and HA failover and only at the end reverts aggregates and LIFs.
Regards
Sebastian
On Mon, Mar 12, 2018, 18:39 Milazzo Giacomo G.Milazzo@sinergy.it wrote:
Hi everybody, past friday during an operation given as NDO we've had a service interruption on NAS component. We had to move the root aggregate from some old disks to new ones and we've literally followed the procedure reported here (our cDOT is 8.3.2P9 on a 4 nodes cluster)
https://kb.netapp.com/app/answers/answer_view/a_id/1030179
In a very simple way it says: A. Check for epsilon on the node you've to migrate and move it to another node A.1 there's a warining about SAN protocols interruptions but we DID NOT have SAN protocols running, only NFS/CIFS. B. Lif migration after the aggregate relocation Well, NFS was restarted and all servers and apps belonging to it went down! I let you imagine customer reaction... Also console after this command: system node modify -node node01 -eligibility false give us a warning about SAN disruption. As I wrote it did not matter us.
Only after that we've found on manual this, but as usual manuals are always less updated than knowledgebase so it could be the last place where to find fresh informations!
https://library.netapp.com/ecmdocs/ECMP1367947/html/GUID-AB52F821-3A25-4E02-...
Moving epsilon for certain manually initiated takeovers Note: Although cluster formation voting can be modified by using the cluster modify -eligibility false command, you should avoid this except for situations such as restoring the node configuration or prolonged node maintenance. If you set a node to be ineligible, it stops serving SAN data until the node is reset to eligible and rebooted. NAS data access to the node might also be affected when the node is ineligible.
And, what does it mean "might be". I translate that as a "nobody knows, try..."
Now the most important question (we must migrate other three nodes!) is this: Assuming that we've well understood that 1. migrate lif and only 2. epsilon false, it there an official answer/doc with updated information that ensure that is this the right procedure to avoid also NAS protocols interruption?
Thank you very much,
Dott. Giacomo Milazzo Senior Consultant & Technical Account Manager mobile: +39 340.6001045 @-mail: g.milazzo@sinergy.it Web: http://www.sinergy.it
SINERGY SpA Viale dei Santi Pietro e Paolo 50 00144 - Roma RM Tel. +39 06 44243674 Fax +39 06 44245272
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Hi Sebastian,
Thank for answer. How are you? A long time has passed since the cdot course we've attended together in Wien. :-)
So it seems that you agree with me that the kb sequence is wrong (there's also another mistake, where is written to reverse the life but that is obvious to discover).
With NFS restart I mean a cycle NFS stop/start that caused the lost of communications.
Effectively the revert steps suggest the path you're suggesting.
By the way. That kb gives inadequate informations and should be corrected asap because it does not mention NAS interruption or the sequence is wrong. Also the prompt of console command does not warn about NAS.
Last, customer would have an official response about these sequence, do you think it could be some documentation?
Customer could assume another risk only if steps are "certified". Otherwise will plan an application stop to avoid disruptions.
Regards
Sent by Mobile
Il 12 mar 2018 20:39, "Sebastian P. Goetze" spgoetze@gmail.com ha scritto: Hmmmmm,
Going through the steps in the KB, I would have done the epsilon and eligibility steps (Step 1 in the KB) right before the reboot (Step 8), *after* moving the aggregate and the LIFs away from the node to be worked on.
At that point in time it shouldn't disturb anything, since no user traffic should pass through this nodes interfaces or disks.
What do you think? (I'm a little unclear about the meaning of "NFS was restarted", but I have a feeling the above change in sequence should help)
Also, if you look at the revert steps, the KB first restores eligibility and HA failover and only at the end reverts aggregates and LIFs.
Regards
Sebastian
On Mon, Mar 12, 2018, 18:39 Milazzo Giacomo <G.Milazzo@sinergy.itmailto:G.Milazzo@sinergy.it> wrote: Hi everybody, past friday during an operation given as NDO we've had a service interruption on NAS component. We had to move the root aggregate from some old disks to new ones and we've literally followed the procedure reported here (our cDOT is 8.3.2P9 on a 4 nodes cluster)
https://kb.netapp.com/app/answers/answer_view/a_id/1030179
In a very simple way it says: A. Check for epsilon on the node you've to migrate and move it to another node A.1 there's a warining about SAN protocols interruptions but we DID NOT have SAN protocols running, only NFS/CIFS. B. Lif migration after the aggregate relocation Well, NFS was restarted and all servers and apps belonging to it went down! I let you imagine customer reaction... Also console after this command: system node modify -node node01 -eligibility false give us a warning about SAN disruption. As I wrote it did not matter us.
Only after that we've found on manual this, but as usual manuals are always less updated than knowledgebase so it could be the last place where to find fresh informations!
https://library.netapp.com/ecmdocs/ECMP1367947/html/GUID-AB52F821-3A25-4E02-...
Moving epsilon for certain manually initiated takeovers Note: Although cluster formation voting can be modified by using the cluster modify -eligibility false command, you should avoid this except for situations such as restoring the node configuration or prolonged node maintenance. If you set a node to be ineligible, it stops serving SAN data until the node is reset to eligible and rebooted. NAS data access to the node might also be affected when the node is ineligible.
And, what does it mean "might be". I translate that as a "nobody knows, try..."
Now the most important question (we must migrate other three nodes!) is this: Assuming that we've well understood that 1. migrate lif and only 2. epsilon false, it there an official answer/doc with updated information that ensure that is this the right procedure to avoid also NAS protocols interruption?
Thank you very much,
Dott. Giacomo Milazzo Senior Consultant & Technical Account Manager mobile: +39 340.6001045 @-mail: g.milazzo@sinergy.itmailto:g.milazzo@sinergy.it Web: http://www.sinergy.it
SINERGY SpA Viale dei Santi Pietro e Paolo 50 00144 - Roma RM Tel. +39 06 44243674 Fax +39 06 44245272
_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters --
sent from my mobile, spellcheck might have messed up...
Hmmmmm,
Going through the steps in the KB, I would have done the epsilon and eligibility steps (Step 1 in the KB) right before the reboot (Step 8), *after* moving the aggregate and the LIFs away from the node to be worked on.
At that point in time it shouldn't disturb anything, since no user traffic should pass through this nodes interfaces or disks.
What do you think? (I'm a little unclear about the meaning of "NFS was restarted", but I have a feeling the above change in sequence should help)
Also, if you look at the revert steps, the KB first restores eligibility and HA failover and only at the end reverts aggregates and LIFs.
Regards
Sebastian
On Mon, Mar 12, 2018, 18:39 Milazzo Giacomo <G.Milazzo@sinergy.itmailto:G.Milazzo@sinergy.it> wrote: Hi everybody, past friday during an operation given as NDO we've had a service interruption on NAS component. We had to move the root aggregate from some old disks to new ones and we've literally followed the procedure reported here (our cDOT is 8.3.2P9 on a 4 nodes cluster)
https://kb.netapp.com/app/answers/answer_view/a_id/1030179
In a very simple way it says: A. Check for epsilon on the node you've to migrate and move it to another node A.1 there's a warining about SAN protocols interruptions but we DID NOT have SAN protocols running, only NFS/CIFS. B. Lif migration after the aggregate relocation Well, NFS was restarted and all servers and apps belonging to it went down! I let you imagine customer reaction... Also console after this command: system node modify -node node01 -eligibility false give us a warning about SAN disruption. As I wrote it did not matter us.
Only after that we've found on manual this, but as usual manuals are always less updated than knowledgebase so it could be the last place where to find fresh informations!
https://library.netapp.com/ecmdocs/ECMP1367947/html/GUID-AB52F821-3A25-4E02-...
Moving epsilon for certain manually initiated takeovers Note: Although cluster formation voting can be modified by using the cluster modify -eligibility false command, you should avoid this except for situations such as restoring the node configuration or prolonged node maintenance. If you set a node to be ineligible, it stops serving SAN data until the node is reset to eligible and rebooted. NAS data access to the node might also be affected when the node is ineligible.
And, what does it mean "might be". I translate that as a "nobody knows, try..."
Now the most important question (we must migrate other three nodes!) is this: Assuming that we've well understood that 1. migrate lif and only 2. epsilon false, it there an official answer/doc with updated information that ensure that is this the right procedure to avoid also NAS protocols interruption?
Thank you very much,
Dott. Giacomo Milazzo Senior Consultant & Technical Account Manager mobile: +39 340.6001045 @-mail: g.milazzo@sinergy.itmailto:g.milazzo@sinergy.it Web: http://www.sinergy.it
SINERGY SpA Viale dei Santi Pietro e Paolo 50 00144 - Roma RM Tel. +39 06 44243674 Fax +39 06 44245272
_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters --
sent from my mobile, spellcheck might have messed up...