vfiler migrate: overview and thoughts
When I first read there was a way to move a vFiler from one node of a
NetApp cluster to another I was excited. I was imagining something akin to
VMWare's vMotion, a transparent movement of services. Digging a little
deeper showed that NetApp's "vfiler migrate" functionality isn't nearly as
automagic as I'd hoped.
Here are some observations.
* Disk ownership of the resources must be software based whether your
filer is using actual disks or array LUNs (on a vSeries filer like our
3170). We had some concerns that the feature might not work well with
array LUNs but it appears that Data OnTap doesn't know any difference
between an Array LUN and an actual disk in this context.
* The vfiler migrate command effectively moves complete aggregates from
one filer head to another. This means that all volumes on the aggregate(s)
involved must be tied only to the vfiler being moved, with no LUNs,
exports or shares presented from the context of the root filer or any
other vFiler (in our environment we already had a standard of creating
separate aggregates for each vFiler so this wasn't a problem). For
example, after one failed attempt to migrate a filer, I had added a CIFS
share to the root volume of the vFiler via the root filer, to gain access
to the etc folder of the vFiler. I forgot to remove that share, and broke
later migration attempts for a new reason.
* We've tested the vfiler migrate command dozens of times now on three
different vFilers, in preparation for the migration of a production vFiler
later this week. Two of those vFilers have migrated flawlessly every time,
and one seems to fail about 30% of the time for various reasons which we
can sometimes identify and sometimes not.
* Reasons for failure include:
- A CIFS share from the root filer head to the vFiler's root
volume. My bad.
- Possible FC noise between the root filer and the SAN behind it.
- Possible SCSI reservations issues between the root filer and the
SAN.
- Invalid credentials (fat-fingered a password, I think) for the
"source" remote root filer. Oddly, the migrate command still stopped the
vFiler, offlined its volumes and aggregates, and removed the vFiler from
the source root filer before the process failed.
- Poor alignment of the planets? Bad karma?
* In general, it seems that the vfiler migrate just fails sometimes. In
every failure, however, recovery has been straightforward. The "vfiler
create <vFiler name> -r <path to vFiler root volume>" command recovers the
vFiler every time, albeit without a proper network configuration. The
vFiler comes back up but with its virtual NIC having no subnet mask or
assignment to a physical interface. Makes sense, I guess, as the migrate
command never got to the part where it normally asks what mask and
interface to use. This needs to be reassigned, either from the CLI using
ifconfig or the "manage vFiler" wizard in FilerView (note that this will
also overwrite etc/exports with a default but will save a backup first).
Given that there are arguably a slew more shops out there running HA
VMWare clusters than there are running HA NetApp clusters, it's probably
not fair to expect that vfiler migrate is going to be as slick or even as
well understood/documented in the wild as vMotion. Also, inherent
limitations in things like the CIFS protocol make it necessary that some
services will have to be interrupted. But overall I'd claim that the
feature is useful and even when it doesn't work as hoped recovery is
straightforward and reliable. We're planning on proceeding with our
production move later this week.
Hope this helps anyone in the same situation,
Randy