Hello All,
We just had a rough night trying to upgrade a 3020 from 7.2.4 to 7.3.7 where the upgraded node came up with no working network connections. Turns out the same setup / RC file that works great in 7.2.4 crumps in 7.3.7. After some flailing, we used revert_to to roll back. In the mean time, our VMware farm had several hundred VMs suffer a ~1 hour loss of their disks and ~40 VMs needed help ranging from just a reboot to a full rollback of the disk image.
This Sunday we're scheduled to upgrade a v3170 pair that supports about three times as many VMs and uses the same network design. We have some reasons to hope that the same issue won't bite us, and also have better options to test before we have to risk the virtual farm, and some other ideas on how to mitigate risk and impact. But it was also suggested that an even better way to avoid a problem would be if we could upgrade a node and then test the RC file before it was actually "given back."
If I do a takeover from Node B, then upgrade Node A, is there a way to bring up Node A far enough to test that its RC file loaded successfully but before it tries to go back into production?
Hope to hear from you,
Randy
Randy,
The easiest way is to look at your /etc/rc file before the upgrade and determine what needs to be changed or what is causing the issue. You can modify your interfaces/vlans when the node is failed back (copy and paste the commands) interactively. Then make sure it keeps by editing the rc file.
Depending on your config, this might be quite simple. I don't know many changes that break the network config between 7.2 to 7.3 -- ifgrp compared to vif maybe? --- I thought those get deprecated, but still work.
-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Randy Rue Sent: Monday, October 22, 2012 5:21 PM To: toasters@teaparty.net Subject: test an upgrade without fully failing back an HA node?
Hello All,
We just had a rough night trying to upgrade a 3020 from 7.2.4 to 7.3.7 where the upgraded node came up with no working network connections. Turns out the same setup / RC file that works great in 7.2.4 crumps in 7.3.7. After some flailing, we used revert_to to roll back. In the mean time, our VMware farm had several hundred VMs suffer a ~1 hour loss of their disks and ~40 VMs needed help ranging from just a reboot to a full rollback of the disk image.
This Sunday we're scheduled to upgrade a v3170 pair that supports about three times as many VMs and uses the same network design. We have some reasons to hope that the same issue won't bite us, and also have better options to test before we have to risk the virtual farm, and some other ideas on how to mitigate risk and impact. But it was also suggested that an even better way to avoid a problem would be if we could upgrade a node and then test the RC file before it was actually "given back."
If I do a takeover from Node B, then upgrade Node A, is there a way to bring up Node A far enough to test that its RC file loaded successfully but before it tries to go back into production?
Hope to hear from you,
Randy
_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Did someone happen, by some unlucky chance...
modify the network connections with Filerview or System Manager?
I know there have been many bugs, most of which blow away the network configs in the /etc/rc file....
Recommend to modify /etc/rc network info via text editor...
I have had more than a customer have his filer reboot and not come up on the network only to find out....the network config was blown out by filerview.
Other than that...were release notes read? Not looking at that specific release, I know at some point NetApp changed some details with the way they reference VIFs and specified a certain order of things for the /etc/rc file.
--tmac Tim McCarthy Principal Consultant
RedHat Certified Engineer 804006984323821 (RHEL4) 805007643429572 (RHEL5)
On Mon, Oct 22, 2012 at 5:21 PM, Randy Rue rrue@fhcrc.org wrote:
Hello All,
We just had a rough night trying to upgrade a 3020 from 7.2.4 to 7.3.7 where the upgraded node came up with no working network connections. Turns out the same setup / RC file that works great in 7.2.4 crumps in 7.3.7. After some flailing, we used revert_to to roll back. In the mean time, our VMware farm had several hundred VMs suffer a ~1 hour loss of their disks and ~40 VMs needed help ranging from just a reboot to a full rollback of the disk image.
This Sunday we're scheduled to upgrade a v3170 pair that supports about three times as many VMs and uses the same network design. We have some reasons to hope that the same issue won't bite us, and also have better options to test before we have to risk the virtual farm, and some other ideas on how to mitigate risk and impact. But it was also suggested that an even better way to avoid a problem would be if we could upgrade a node and then test the RC file before it was actually "given back."
If I do a takeover from Node B, then upgrade Node A, is there a way to bring up Node A far enough to test that its RC file loaded successfully but before it tries to go back into production?
Hope to hear from you,
Randy
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
On Mon, Oct 22, 2012 at 2:21 PM, Randy Rue rrue@fhcrc.org wrote:
Hello All,
Turnsout the same setup / RC file that works great in 7.2.4 crumps in 7.3.7. After some flailing, we used revert_to to roll back.
Why did you have to rollback ONTAP just 'cause the rc file is not loading?
You can always source the file to see what's failing and modify accordingly. In one of our upgrades, the rc file was missing (not sure why) All we had to do was rdfile /etc/.snapshot/hourly.0/rc copy and paste into wrfile and then source it. HTH
Our network configuration is slightly weird, we build four DMMVIFS, each with an LACP bonded group of a single physical interface. Then those are bonded into two SVIFS, which carry the actual IP addresses. This was arrived at after a lot of deliberation by our network group and NetApp technical staff to address concerns with NetApp's LACP implementation back in the era of 7.2.4. My understanding is that it was intended to protect us from situations where the interface still had a link but no traffic flowing, i.e. a failed supervisor card in the switch .
Changing that configuration is probably a good idea but will need to involve our network architects, our change approval board, thorough testing and some advance notice for the outage.
In last saturday's failed upgrade the RC file was still there, it just failed to finish building the SVIFs. I wasn't willing to fling out a new setup on the fly so my first option was to roll back until we could take a thorough look at the problem.
So. We're scheduled to try a 7.3.5.1P5 to 8.1.1 upgrade on a v3170 this weekend. That 3170 is using the same network setup. One of the questions I've been asked is whether we can fail over A to B, upgrade A, and then bring up A sufficient to see that its network setup is running BEFORE we actually giveback? Test the network without putting production services back on a possibly broken node?
Can anyone help me with that question?
Randy
----- Original Message -----
From: "Sto Rage©" netbacker@gmail.com To: "Randy Rue" rrue@fhcrc.org Cc: toasters@teaparty.net Sent: Monday, October 22, 2012 4:12:34 PM Subject: Re: test an upgrade without fully failing back an HA node?
On Mon, Oct 22, 2012 at 2:21 PM, Randy Rue < rrue@fhcrc.org > wrote:
Hello All,
Turnsout the same setup / RC file that works great in 7.2.4 crumps in 7.3.7. After some flailing, we used revert_to to roll back.
Why did you have to rollback ONTAP just 'cause the rc file is not loading?
You can always source the file to see what's failing and modify accordingly. In one of our upgrades, the rc file was missing (not sure why) All we had to do was rdfile /etc/.snapshot/hourly.0/rc copy and paste into wrfile and then source it. HTH
When you failover to node b - you will be running old OS and old rc file - until you giveback you are not running new OS with old or new rc file - unless there is a boot option from maintenance mode? Sent from my Verizon Wireless BlackBerry
-----Original Message----- From: Randy Rue rrue@fhcrc.org Sender: toasters-bounces@teaparty.net Date: Tue, 23 Oct 2012 08:24:22 To: Sto Rage©netbacker@gmail.com Cc: toasters@teaparty.net Subject: Re: test an upgrade without fully failing back an HA node?
_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Hi Randy,
there's _no_ way to start up a node 'waiting for giveback'. To actually start, it would have to grab the disks of the root aggr/vol from it's partner.
How about checking the network config in a simulator? Just download the 8.1.1 Sim-VM and check your /etc/rc (maybe with different IPs...)
There's also a 7.3.6 (?) sim (don't know if that's the latest one of the 7.3 line). You'll just need a (simulated) *NIX host to run that one on.
Hope that helps
Sebastian
On 23.10.2012 17:24, Randy Rue wrote:
Our network configuration is slightly weird, we build four DMMVIFS, each with an LACP bonded group of a single physical interface. Then those are bonded into two SVIFS, which carry the actual IP addresses. This was arrived at after a lot of deliberation by our network group and NetApp technical staff to address concerns with NetApp's LACP implementation back in the era of 7.2.4. My understanding is that it was intended to protect us from situations where the interface still had a link but no traffic flowing, i.e. a failed supervisor card in the switch.
Changing that configuration is probably a good idea but will need to involve our network architects, our change approval board, thorough testing and some advance notice for the outage.
In last saturday's failed upgrade the RC file was still there, it just failed to finish building the SVIFs. I wasn't willing to fling out a new setup on the fly so my first option was to roll back until we could take a thorough look at the problem.
So. We're scheduled to try a 7.3.5.1P5 to 8.1.1 upgrade on a v3170 this weekend. That 3170 is using the same network setup. One of the questions I've been asked is whether we can fail over A to B, upgrade A, and then bring up A sufficient to see that its network setup is running BEFORE we actually giveback? Test the network without putting production services back on a possibly broken node?
Can anyone help me with that question?
Randy
*From: *"Sto Rage©" netbacker@gmail.com *To: *"Randy Rue" rrue@fhcrc.org *Cc: *toasters@teaparty.net *Sent: *Monday, October 22, 2012 4:12:34 PM *Subject: *Re: test an upgrade without fully failing back an HA node?
On Mon, Oct 22, 2012 at 2:21 PM, Randy Rue <rrue@fhcrc.org mailto:rrue@fhcrc.org> wrote:
Hello All, Turnsout the same setup / RC file that works great in 7.2.4 crumps in 7.3.7. After some flailing, we used revert_to to roll back.
Why did you have to rollback ONTAP just 'cause the rc file is not loading?
You can always source the file to see what's failing and modify accordingly. In one of our upgrades, the rc file was missing (not sure why) All we had to do was rdfile /etc/.snapshot/hourly.0/rc copy and paste into wrfile and then source it. HTH
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Randy,
I don't think it's possible to do what you're talking about, since the overtaken filer's MAC and IP address are going to be in use by it's partner to service the storage requests. IIRC, after reboot, the overtaken filer brings it's interface up to Layer 2 functionality, but no IP configuration is done to avoid polluting ARP records and getting weird things happening in the networking stack.
But I could be wrong...
cheers,
- -=Tom Nail
On Tue, 23 Oct 2012 08:24:22 -0700 Randy Rue rrue@fhcrc.org wrote:
Our network configuration is slightly weird, we build four DMMVIFS, each with an LACP bonded group of a single physical interface. Then those are bonded into two SVIFS, which carry the actual IP addresses. This was arrived at after a lot of deliberation by our network group and NetApp technical staff to address concerns with NetApp's LACP implementation back in the era of 7.2.4. My understanding is that it was intended to protect us from situations where the interface still had a link but no traffic flowing, i.e. a failed supervisor card in the switch .
Changing that configuration is probably a good idea but will need to involve our network architects, our change approval board, thorough testing and some advance notice for the outage.
In last saturday's failed upgrade the RC file was still there, it just failed to finish building the SVIFs. I wasn't willing to fling out a new setup on the fly so my first option was to roll back until we could take a thorough look at the problem.
So. We're scheduled to try a 7.3.5.1P5 to 8.1.1 upgrade on a v3170 this weekend. That 3170 is using the same network setup. One of the questions I've been asked is whether we can fail over A to B, upgrade A, and then bring up A sufficient to see that its network setup is running BEFORE we actually giveback? Test the network without putting production services back on a possibly broken node?
Can anyone help me with that question?
Randy
----- Original Message -----
From: "Sto Rage©" netbacker@gmail.com To: "Randy Rue" rrue@fhcrc.org Cc: toasters@teaparty.net Sent: Monday, October 22, 2012 4:12:34 PM Subject: Re: test an upgrade without fully failing back an HA node?
On Mon, Oct 22, 2012 at 2:21 PM, Randy Rue < rrue@fhcrc.org > wrote:
Hello All,
Turnsout the same setup / RC file that works great in 7.2.4 crumps in 7.3.7. After some flailing, we used revert_to to roll back.
Why did you have to rollback ONTAP just 'cause the rc file is not loading?
You can always source the file to see what's failing and modify accordingly. In one of our upgrades, the rc file was missing (not sure why) All we had to do was rdfile /etc/.snapshot/hourly.0/rc copy and paste into wrfile and then source it. HTH