Guys,
We're getting ready to ugprade our 4 node 8060 cluster from 8.3.2P9 to 9.1P19 and then onto 9.3P17, all this weekend. My only real concern is the upgrade from 9.1 to 9.3, which lists a major warning for bug 1250500:
Expired truststore security certificates causing upgrade and new installation failures.
Unfortunately, I can't run an upgrade advisor report for my cluster going from 9.1P19 to 9.3P17, because I'm not yet running 9.1, and it can take upto a week for the autosupport data to get pushed to Upgrade Advisor. Sigh...
Has anyone run into this issue when doing the 9.1 -> 9.3 upgrade with the expired certificate? Otherwise, it all looks good, my cluster switches are supported at their current version, etc.
John
Is 9.4 supported on the 8060? I think later versions of 9.4 avoid that cert expiration issue.
Kevin
----- Original Message ----- From: "John Stoffel" john@stoffel.org To: toasters@teaparty.net Sent: Tuesday, March 3, 2020 4:35:13 PM Subject: Upgrade thoughts: 8.3 -> 9.1 -> 9.3
Guys,
We're getting ready to ugprade our 4 node 8060 cluster from 8.3.2P9 to 9.1P19 and then onto 9.3P17, all this weekend. My only real concern is the upgrade from 9.1 to 9.3, which lists a major warning for bug 1250500:
Expired truststore security certificates causing upgrade and new installation failures.
Unfortunately, I can't run an upgrade advisor report for my cluster going from 9.1P19 to 9.3P17, because I'm not yet running 9.1, and it can take upto a week for the autosupport data to get pushed to Upgrade Advisor. Sigh...
Has anyone run into this issue when doing the 9.1 -> 9.3 upgrade with the expired certificate? Otherwise, it all looks good, my cluster switches are supported at their current version, etc.
John _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Kevin> Is 9.4 supported on the 8060? I think later versions of 9.4 Kevin> avoid that cert expiration issue.
It looks like OnTap 9.7 is still supported on the FAS8060s, the Hardware Universe Tool seems to show that it's still supported all the way upto 9.7, which I doubt we'll ever get to before we retire this cluster and move to something newer.
Kevin> ----- Original Message ----- Kevin> From: "John Stoffel" john@stoffel.org Kevin> To: toasters@teaparty.net Kevin> Sent: Tuesday, March 3, 2020 4:35:13 PM Kevin> Subject: Upgrade thoughts: 8.3 -> 9.1 -> 9.3
Kevin> Guys,
Kevin> We're getting ready to ugprade our 4 node 8060 cluster from 8.3.2P9 to Kevin> 9.1P19 and then onto 9.3P17, all this weekend. My only real concern Kevin> is the upgrade from 9.1 to 9.3, which lists a major warning for bug Kevin> 1250500:
Kevin> Expired truststore security certificates causing upgrade and Kevin> new installation failures.
Kevin> Unfortunately, I can't run an upgrade advisor report for my cluster Kevin> going from 9.1P19 to 9.3P17, because I'm not yet running 9.1, and it Kevin> can take upto a week for the autosupport data to get pushed to Upgrade Kevin> Advisor. Sigh...
Kevin> Has anyone run into this issue when doing the 9.1 -> 9.3 upgrade with Kevin> the expired certificate? Otherwise, it all looks good, my cluster Kevin> switches are supported at their current version, etc.
Kevin> John Kevin> _______________________________________________ Kevin> Toasters mailing list Kevin> Toasters@teaparty.net Kevin> http://www.teaparty.net/mailman/listinfo/toasters
John, Did you have a look at fastpath? https://whyistheinternetbroken.wordpress.com/2018/02/16/ipfastpath-ontap92/ It seems every time we put in a case for upgrades to 9.3 netapp support tries to make sure we looked into this! So it must've bitten a lot of folks.
We did run into a pretty big bug on the upgrade from 9.1 to 9.3P15 -- we have a case/core in now. I've seen nfs stop serving from a node in at least 3 clusters roughly a 2-5 hours after the upgrade. We fix it by indicating the unresponsive node and either powering it down, or via NMI/SP. It will not respond to normal takeover commands. Preliminary core analysis (no full core analysis yet) points at at least 1 bug fixed in 9.3P17.
https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1236722
Typically when we roll updates it can take months given the number of nodes and clusters. So we stick with whatever P patch we rolled on the first set of nodes, and then by the end of upgrades 1-3 P patches are released.
With this experience, always use the latest P patch possible on the intermediary update especially if you are going to take a bit to roll it through your entire deployment. I also recommend taking a look at going to 9.5, it sounds nuts, but we've had better stability with this release. We move to this release because of a specific feature that was needed (CIFS/SMB enhancements, and flexcache/flexgroups).
Regards, Douglas
On Tue, Mar 3, 2020 at 4:41 PM John Stoffel john@stoffel.org wrote:
Guys,
We're getting ready to ugprade our 4 node 8060 cluster from 8.3.2P9 to 9.1P19 and then onto 9.3P17, all this weekend. My only real concern is the upgrade from 9.1 to 9.3, which lists a major warning for bug 1250500:
Expired truststore security certificates causing upgrade and new installation failures.
Unfortunately, I can't run an upgrade advisor report for my cluster going from 9.1P19 to 9.3P17, because I'm not yet running 9.1, and it can take upto a week for the autosupport data to get pushed to Upgrade Advisor. Sigh...
Has anyone run into this issue when doing the 9.1 -> 9.3 upgrade with the expired certificate? Otherwise, it all looks good, my cluster switches are supported at their current version, etc.
John _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Douglas> Did you have a look at fastpath?
I did, and I think I'm all ok because most of my SVMs have just one network associated with them, and I've also moved all my management off to a completely seperate subnet.
Our cluster is stupid simple, just 4 x 10gb LACP trunks from each node (4 nodes total) with a number of VLANs running over those trunks. the SVMs are reasonably designed, though we did make some mistakes many years ago when we first set things up that I would do differently now.
Douglas> https://whyistheinternetbroken.wordpress.com/2018/02/16/ipfastpath-ontap92/ Douglas> It seems every time we put in a case for upgrades to 9.3 Douglas> netapp support tries to make sure we looked into this! So it Douglas> must've bitten a lot of folks.
I think so. I hope we're all set.
Douglas> We did run into a pretty big bug on the upgrade from 9.1 to Douglas> 9.3P15 -- we have a case/core in now. I've seen nfs stop Douglas> serving from a node in at least 3 clusters roughly a 2-5 Douglas> hours after the upgrade. We fix it by indicating the Douglas> unresponsive node and either powering it down, or via NMI/SP. Douglas> It will not respond to normal takeover commands. Preliminary Douglas> core analysis (no full core analysis yet) points at at least Douglas> 1 bug fixed in 9.3P17.
Yikes! That is a big bug to have to deal with.
Douglas> https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1236722
Douglas> Typically when we roll updates it can take months given the Douglas> number of nodes and clusters. So we stick with whatever P Douglas> patch we rolled on the first set of nodes, and then by the Douglas> end of upgrades 1-3 P patches are released.
You have a much bigger environment than we have! I think next time we might do smaller pairs of nodes, so we can just VMotion VMs back and forth and hopefully have enough space to be a bit more proactive on upgrades without disrupting everything with a full shutdown.
Douglas> With this experience, always use the latest P patch possible Douglas> on the intermediary update especially if you are going to Douglas> take a bit to roll it through your entire deployment. I also Douglas> recommend taking a look at going to 9.5, it sounds nuts, but Douglas> we've had better stability with this release. We move to this Douglas> release because of a specific feature that was needed Douglas> (CIFS/SMB enhancements, and flexcache/flexgroups).
I had thought of going that far up, but just getting the downtime for the two jumps I need to do has been hard enough. But we're learning our lesson and trying to do upgrades more frequently. We only have this one cluster though and it runs everything.
Douglas> On Tue, Mar 3, 2020 at 4:41 PM John Stoffel john@stoffel.org wrote:
Douglas> Guys,
Douglas> We're getting ready to ugprade our 4 node 8060 cluster from 8.3.2P9 to Douglas> 9.1P19 and then onto 9.3P17, all this weekend. My only real concern Douglas> is the upgrade from 9.1 to 9.3, which lists a major warning for bug Douglas> 1250500:
Douglas> Expired truststore security certificates causing upgrade and Douglas> new installation failures.
Douglas> Unfortunately, I can't run an upgrade advisor report for my cluster Douglas> going from 9.1P19 to 9.3P17, because I'm not yet running 9.1, and it Douglas> can take upto a week for the autosupport data to get pushed to Upgrade Douglas> Advisor. Sigh...
Douglas> Has anyone run into this issue when doing the 9.1 -> 9.3 upgrade with Douglas> the expired certificate? Otherwise, it all looks good, my cluster Douglas> switches are supported at their current version, etc.
Douglas> John Douglas> _______________________________________________ Douglas> Toasters mailing list Douglas> Toasters@teaparty.net Douglas> http://www.teaparty.net/mailman/listinfo/toasters
John,
You will be fine in performing that upgrade from 9.1P19 to 9.3P17 (I’d recommend going to the latest 9.3P18 release). Regarding the bug around the security certificates, you need to be at 9.1P14 or higher prior to performing the upgrade to 9.3.
There are some other items that I would recommend that you check in your environment. I believe I saw in another email reply the note about Fastpath. Definitely do your homework on that one.
Also, and this is very important, make sure that your SP firmware is up to date for the releases that you are on/going to. I ran into this issue 3 times last week during an upgrade where the SP firmware wasn’t up to date and the controllers, when rebooting during the ONTAP upgrade, halted and returned to the loader prompt with the error "This platform is not supported in this release”.
I was able to resolve it via the SP command by performing a “dirty shutdown” of that node and powering it back up and then performing a SP reboot. There is a NetApp KB article, 1009154, that is related (talks about a different platform) but the fix resolves this.
Also, I *highly* recommend updating all disk/shelf firmwares and qualification files ahead of the game (at least 24 hours).
Good luck and HTH
Regards, André M. Clark
On March 3, 2020 at 16:41:53, Stoffel John (john@stoffel.org) wrote:
Guys,
We're getting ready to ugprade our 4 node 8060 cluster from 8.3.2P9 to 9.1P19 and then onto 9.3P17, all this weekend. My only real concern is the upgrade from 9.1 to 9.3, which lists a major warning for bug 1250500:
Expired truststore security certificates causing upgrade and new installation failures.
Unfortunately, I can't run an upgrade advisor report for my cluster going from 9.1P19 to 9.3P17, because I'm not yet running 9.1, and it can take upto a week for the autosupport data to get pushed to Upgrade Advisor. Sigh...
Has anyone run into this issue when doing the 9.1 -> 9.3 upgrade with the expired certificate? Otherwise, it all looks good, my cluster switches are supported at their current version, etc.
John _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
André> You will be fine in performing that upgrade from 9.1P19 to André> 9.3P17 (I’d recommend going to the latest 9.3P18 André> release). Regarding the bug around the security certificates, André> you need to be at 9.1P14 or higher prior to performing the André> upgrade to 9.3.
Thats good then, since we're going to 9.1P19.
André> There are some other items that I would recommend that you André> check in your environment. I believe I saw in another email André> reply the note about Fastpath. Definitely do your homework on André> that one.
I think we're all set there.
André> Also, and this is very important, make sure that your SP André> firmware is up to date for the releases that you are on/going André> to. I ran into this issue 3 times last week during an upgrade André> where the SP firmware wasn’t up to date and the controllers, André> when rebooting during the ONTAP upgrade, halted and returned to André> the loader prompt with the error "This platform is not André> supported in this release”.
This is a good thing to know, mine are all at 3.1.2 version. I'll see if there's a newer version and plan on upgrading them all ahead of time if I can. Thought I might not, since 3.1.2 is the latest version supported with OnTap 8.3P2, so the 9.1P19 will give me upto SP version 3.9 to install.
André> I was able to resolve it via the SP command by performing André> a “dirty shutdown” of that node and powering it back up and André> then performing a SP reboot. There is a NetApp KB article, André> 1009154, that is related (talks about a different platform) but André> the fix resolves this.
Thanks for this info, I'll certainly look into this ASAP.
André> Also, I highly recommend updating all disk/shelf firmwares and André> qualification files ahead of the game (at least 24 hours).
Great idea, I can start this tonight I think.
Well, just to let you all know that the upgrade went great, and I even did the CN1610 cluster switch upgrade as well, even though the version I was on was still supported with 9.3P17.
The only real gotcha that hit me was having both regular and e0M ports in the same broadcast domain, which since the node_management ports were on e0M made it a *pain* in the ass to get them moved out.
I basically had to do:
- create new node_mgmt interface on another VLAN port. - delete the one I really wnated, - remove the e0M port from the broadcast-domain - create a new new failover group and add e0M to it - recreate a new node management lif using e0M again - delete other temp lif.
Across four nodes, this sucked, especially to figure out. Then I found the "broadcast-domain split " command, which might have made this all oh so much easier.
Anyway, the actual upgrade went great, no outage for the VMs still on the ESX cluster, etc. Very very nice. Next time I ask them if I can just do this during the day instead. LOL!
John
Oh No!
You should have asked!
There is a broadcast-domain merge and a broadcast-domain Split for just such these occasions!
--tmac
*Tim McCarthy, **Principal Consultant*
*Proud Member of the #NetAppATeam https://twitter.com/NetAppATeam*
*I Blog at TMACsRack https://tmacsrack.wordpress.com/*
On Sun, Mar 8, 2020 at 8:37 PM John Stoffel john@stoffel.org wrote:
Well, just to let you all know that the upgrade went great, and I even did the CN1610 cluster switch upgrade as well, even though the version I was on was still supported with 9.3P17.
The only real gotcha that hit me was having both regular and e0M ports in the same broadcast domain, which since the node_management ports were on e0M made it a *pain* in the ass to get them moved out.
I basically had to do:
- create new node_mgmt interface on another VLAN port.
- delete the one I really wnated,
- remove the e0M port from the broadcast-domain
- create a new new failover group and add e0M to it
- recreate a new node management lif using e0M again
- delete other temp lif.
Across four nodes, this sucked, especially to figure out. Then I found the "broadcast-domain split " command, which might have made this all oh so much easier.
Anyway, the actual upgrade went great, no outage for the VMs still on the ESX cluster, etc. Very very nice. Next time I ask them if I can just do this during the day instead. LOL!
John
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
I did the same thing a few ago and then gave myself the forehead slap after I found out. The Split/Merge is so quick and easy.
I know what a PITA it was to do what you did!
--tmac
*Tim McCarthy, **Principal Consultant*
*Proud Member of the #NetAppATeam https://twitter.com/NetAppATeam*
*I Blog at TMACsRack https://tmacsrack.wordpress.com/*
On Sun, Mar 8, 2020 at 8:44 PM tmac tmacmd@gmail.com wrote:
Oh No!
You should have asked!
There is a broadcast-domain merge and a broadcast-domain Split for just such these occasions!
--tmac
*Tim McCarthy, **Principal Consultant*
*Proud Member of the #NetAppATeam https://twitter.com/NetAppATeam*
*I Blog at TMACsRack https://tmacsrack.wordpress.com/*
On Sun, Mar 8, 2020 at 8:37 PM John Stoffel john@stoffel.org wrote:
Well, just to let you all know that the upgrade went great, and I even did the CN1610 cluster switch upgrade as well, even though the version I was on was still supported with 9.3P17.
The only real gotcha that hit me was having both regular and e0M ports in the same broadcast domain, which since the node_management ports were on e0M made it a *pain* in the ass to get them moved out.
I basically had to do:
- create new node_mgmt interface on another VLAN port.
- delete the one I really wnated,
- remove the e0M port from the broadcast-domain
- create a new new failover group and add e0M to it
- recreate a new node management lif using e0M again
- delete other temp lif.
Across four nodes, this sucked, especially to figure out. Then I found the "broadcast-domain split " command, which might have made this all oh so much easier.
Anyway, the actual upgrade went great, no outage for the VMs still on the ESX cluster, etc. Very very nice. Next time I ask them if I can just do this during the day instead. LOL!
John
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
tmac> I did the same thing a few ago and then gave myself the forehead tmac> slap after I found out. The Split/Merge is so quick and easy.
tmac> I know what a PITA it was to do what you did!
It totally was a pain, and I think I've still got some issues, but doing the work by hand was a good learning experience. But still painful. Heh.