vfiler migrate: overview and thoughts
When I first read there was a way to move a vFiler from one node of a NetApp cluster to another I was excited. I was imagining something akin to VMWare's vMotion, a transparent movement of services. Digging a little deeper showed that NetApp's "vfiler migrate" functionality isn't nearly as automagic as I'd hoped.
Here are some observations.
* Disk ownership of the resources must be software based whether your filer is using actual disks or array LUNs (on a vSeries filer like our 3170). We had some concerns that the feature might not work well with array LUNs but it appears that Data OnTap doesn't know any difference between an Array LUN and an actual disk in this context.
* The vfiler migrate command effectively moves complete aggregates from one filer head to another. This means that all volumes on the aggregate(s) involved must be tied only to the vfiler being moved, with no LUNs, exports or shares presented from the context of the root filer or any other vFiler (in our environment we already had a standard of creating separate aggregates for each vFiler so this wasn't a problem). For example, after one failed attempt to migrate a filer, I had added a CIFS share to the root volume of the vFiler via the root filer, to gain access to the etc folder of the vFiler. I forgot to remove that share, and broke later migration attempts for a new reason.
* We've tested the vfiler migrate command dozens of times now on three different vFilers, in preparation for the migration of a production vFiler later this week. Two of those vFilers have migrated flawlessly every time, and one seems to fail about 30% of the time for various reasons which we can sometimes identify and sometimes not.
* Reasons for failure include: - A CIFS share from the root filer head to the vFiler's root volume. My bad. - Possible FC noise between the root filer and the SAN behind it. - Possible SCSI reservations issues between the root filer and the SAN. - Invalid credentials (fat-fingered a password, I think) for the "source" remote root filer. Oddly, the migrate command still stopped the vFiler, offlined its volumes and aggregates, and removed the vFiler from the source root filer before the process failed. - Poor alignment of the planets? Bad karma?
* In general, it seems that the vfiler migrate just fails sometimes. In every failure, however, recovery has been straightforward. The "vfiler create <vFiler name> -r <path to vFiler root volume>" command recovers the vFiler every time, albeit without a proper network configuration. The vFiler comes back up but with its virtual NIC having no subnet mask or assignment to a physical interface. Makes sense, I guess, as the migrate command never got to the part where it normally asks what mask and interface to use. This needs to be reassigned, either from the CLI using ifconfig or the "manage vFiler" wizard in FilerView (note that this will also overwrite etc/exports with a default but will save a backup first).
Given that there are arguably a slew more shops out there running HA VMWare clusters than there are running HA NetApp clusters, it's probably not fair to expect that vfiler migrate is going to be as slick or even as well understood/documented in the wild as vMotion. Also, inherent limitations in things like the CIFS protocol make it necessary that some services will have to be interrupted. But overall I'd claim that the feature is useful and even when it doesn't work as hoped recovery is straightforward and reliable. We're planning on proceeding with our production move later this week.
Hope this helps anyone in the same situation,
Randy