Thanks for sharing this info! This will hopefully help someone else down the line.
Jan-Pieter> Following up to my own question to recap. TL;DR: It Jan-Pieter> worked! We added 9.2 nodes to the 9.1 cluster, moved Jan-Pieter> everything, then kicked off the 9.1 nodes.
Jan-Pieter> On 27-11-17 16:46, Jan-Pieter Cornet wrote:
We have a FAS3250 that's primarily backup storage.[...] However... the FAS3250 hardware can only run ontap 9.1. Newer versions of ontap are not available on that hardware. And to make the 8200 nodes join the existing cluster, it has to run the same sofware version (9.1), so... it cannot be initialized with ADP.
And you can't initialize first and partition later, because repartitioning wipes all existing data.
Jan-Pieter> Back to our starting position... we started out with a 2-node 3250 cluster running 9.1, and just added two 8200 nodes also running 9.1(P7).
Jan-Pieter> That configuration wasted 6 disks for root aggrs, which on 8T disks is a lot.
Jan-Pieter> The 4 nodes were already connected using the proper switches (which we hired from netapp specifically for the upgrade).
Jan-Pieter> After a few failed attempts at trying to get 9.1 nodes to partition disks manually (which is possible in maintenance mode) and installing an OS on them (which seems impossible on 9.1), we started by kicking the 8200 out of the cluster, and (without disconnecting the switches! :) create a new 2-node cluster of the 8200s.
Jan-Pieter> The machines are physically located at a remote site so we didn't want to drive up there too often just to fiddle with cluster interconnects, so we didn't. Apparently having 2 clusters share cluster switches works (but is likely unsupported).
Jan-Pieter> We upgraded the 8200 nodes to 9.2P1, and then we initialised those again using the option "9" on the boot menu, creating root partitions. That part was relatively easy.
Jan-Pieter> Next, we tore down that cluster again, and joined the 8200 nodes, now running 9.2P1, in the 3250 cluster. From then on, whenever you log in, you get a notice at login saying:
Warning: The cluster is in a mixed version state. Update all of the nodes to the same version as soon as possible.
Jan-Pieter> Or in other words: here be dragons. And we did find some.
Jan-Pieter> For starters, first thing we had to do on the 8200s was to create the data aggrs.
Jan-Pieter> That only worked on one of the nodes. One node failed with a timeout, leaving the cluster without an aggregate, while the aggregate was eventually created but only visible in "node shell", by giving 7-mode commands via "node run -node NODENAME aggr status". The timeout was likely caused by the fact that several additional drives needed to be partitioned to create the aggr (which is very neat - you really only loose the minimum possible space with that setup).
Jan-Pieter> Support was pouting a bit at that configuration and didn't come with a solution, so we fixed it ourselves, wiping the faulty aggr by first taking it offline in 7-mode (node run -node NODENAME aggr offline FAULTYAGGR), and then running "aggr remove-stale-record" in diag mode. There's unfortunately no way to import an aggr in cDOT, not even an empty one. (That I know of).
Jan-Pieter> Then we simply tried again to create the aggr, but this time we connected to the console of the node where the aggr had to be made (one of the 8200s). That node is in fact running the new version (9.2P1), even though "version" still shows 9.1. This time, it worked.
Jan-Pieter> Then we started "vol move". That went without much of an incident, except that it took quite a while (about a week). We made sure to only run one "vol move" per aggr in parallel from the 3250 nodes, as not to overload it. The moves went faster as more volumes migrated to the 8200s.
Jan-Pieter> Halfway through, we noticed one of the new ethernet cables to the 8200 was faulty resulting in a lot of CRC errors on the link, and unreliable/slow network connections. That caused a bit of extra lag in snapmirror, but could fortunately be easily fixed by swapping a cable. However, one of the snapmirror relations now complains about "CSM: Operation referred to a non-existent session.", and in experimenting we again noticed that it matters on which node you issue commands. It seemed to work better (or at least different) if connected to a node running a new version, instead of a node running the older version (what is output by "run local version" matters).
Jan-Pieter> We migrated all LIFs to the new nodes, and proceeded to remove the 3250 from the cluster. That again resulted in an error if you tried it while connected to the 3250 (probably again due to the underlying version of the node).
Jan-Pieter> Connected to a 8200 node, we were able to remove the first 3250, but the second failed with "Cannot unjoin node X because it is the last node in the cluster with an earlier version of Data ONTAP than the rest of the cluster. Upgrade the node and then try to unjoin it again.". Fortunately, there is a diag mode "cluster unjoin -skip-last-low-version-node-check", and that worked. Immediately, "version" on the cluster shell reported the new version.
Jan-Pieter> The cluster now consists of just the two 8200, with partitioned disks for root aggrs, and all of the data moved without any downtime using "vol move". The old nodes are being wiped.
Jan-Pieter> Thanks a lot for the helpful replies! As special tip of the hat to tmac, who very quickly pointed us in the right direction. That really helped a lot!
Jan-Pieter> -- Jan-Pieter> Jan-Pieter Cornet johnpc@xs4all.nl Jan-Pieter> "Any sufficiently advanced incompetence is indistinguishable from malice." Jan-Pieter> - Grey's Law
Jan-Pieter> [DELETED ATTACHMENT signature.asc, application/pgp-signature] Jan-Pieter> _______________________________________________ Jan-Pieter> Toasters mailing list Jan-Pieter> Toasters@teaparty.net Jan-Pieter> http://www.teaparty.net/mailman/listinfo/toasters