well, that was an interesting weekend.
we dump/restore upgraded an F330 to a mixed-disc configuration. i
thought i'd share some of the high points in case anyone else is
thinking of doing this. i'd add that i'm typing this at 0300 on
monday, so excuse the typos. this is also quite long, so feel free to
ignore completely. if you just want to find my recommendations, look
for digits in column one.
that F330 had three shelves of 4Gb FSCSI discs. we wanted to upgrade
to two shelves of 4Gb FSCSI discs and one shelf of 9Gb FWSCSI discs.
ONTAP 4.2 is a prerequisite for this, so we'd been running stably at
4.2 for about ten days prior to the weekend (thanks, ja mah). our
vendor (OSS) was lending us an F330 with two shelves of 9Gb FWSCSIs to
dump my toaster to, as i'd prefer not to get 98% of the way through a
tape restore and have i/o errors (thanks, OSS).
design note: first impression on the eurologics shelves. these things
came from the hans giger school of weird and creepy alien design.
also, they look substantially less well-made than the old storageworks
shelves. i'm putting that down to the lack of a guide bar ensuring that
all the discs are precisely lined up. not that it matters. i hope.
having installed the loaner toaster, the question of how to dump arose.
i'd already borrowed an Ultra-2 from another project, and set it up as
administration host for both local and loaner toasters. we'd heard that
good things arose from fddi - dump times very significantly faster than
even 100bT - so we tried using fddi/cddi on all three boxes connected to
our cddi/fddi concentrator. the dump failed after about ninety
minutes. just hung. i presumed that this was due to the cddi drivers
being somewhat old, and decided to punt to 100bT, connecting all machines
to our main switch (catalyst 5000). the dump hung after about ninety
minutes. i presumed that this was an old dump bug biting us and
down-rev'ed the local toaster to 4.0.1cD10 (which it had been running
stably for some while before this) and restarted the dump over 100bT. the
dump hung after about ninety minutes. i presumed i was being stupid and
needed sleep and went home.
in the light of day, some console messages gave us a clue to the truth: our
cisco router was lying about the ethernet address of the administration
host, and the local toaster was losing touch with the admin host. although
this is a feature of our local network topology, i think recommendation one
has to be
1) use a private network. i don't care how good your main networking kit
is, you can do without the other packets.
we put them onto a private fddi/cddi conentrator and started the dump.
whilst it didn't hang, i wasn't getting stellar dump times, so we aborted
the dump, and moved everything onto a private 100bT switch. dump times
were much better, which i put down to the fddi drivers on the Sun being
somewhat crufty, whilst the FE drivers are cutting-edge. recommendation
two:
2) doesn't matter how great fddi is if the drivers aren't stellar. FE may
well be better.
if anyone's curious, the dump syntax was brutally simple:
admin# rsh local "dump 0f - /" | rsh loaner "restore rfD - /restore"
dumps finished about midnight (after an excellent sushi dinner for the unix
and networking crew. thanks, shogun 9.). it took fifteen hours to dump
70Gb over FE, with fairly aggressive writing on the destination toaster
(continuous writes and CPU at 70-80%). which leads me to
3) the netapp white paper on migration suggests doing a level 0 to your
destination machine in work time a day or so beforehand, and following up
with a level 1 in downtime. i really wish i'd done that instead of doing
my level zero in (expensive) downtime.
i then zeroed out all the disc labels on my toaster (floppy boot,
option 5) and pulled off the third shelf. removed slot 9 SCSI card,
replaced it with FWSCSI card, tried to boot. F330 hangs in "probing
devices" phase of boot. thanks to some excellent late-night phone support
by tony liu (MANY thanks, tony) it turns out that the FWSCSI card requires
firmware 1.6, although ONTAP 4.2 doesn't. i'd already asked my vendor that
question directly and got a different answer (grumble).
emergency firmware upgrade to 1.6 ensues. toaster now boots, so i can get
on with making a new file system (floppy boot, option 4). you've never
known true nervousness until you watch all your users' home directories
being newfs'ed.
we've been told that the correct way to get a 9Gb and a 4Gb hot spare is to
allow the newfs to allocate a 9Gb hot spare, and then RAID SWAP in a 4Gb
into a slot left unoccupied in shelf 0. when we try the RAID SWAP, the bus
fails to reset properly, and the damn thing goes into a variety of raid
panic situations. all discs in the shelf on slot 0 are also showing amber
"failure" lights, although they appear to work fine. lengthy calls to
netapp fail to rectify this problem, and it's getting to the point
where the restore won't finish until after 0001 monday, which is
getting closer to the promised 0700 "everything back" time than i'd
like, so i decided to go with a system which seems to handle the hot
spare disc correctly - in the event of a disc failure, it RAID
reconstructs to the 9Gb hot spare - but can't do hot swap. it's no
worse than my old 450s, though i hope netapp will be able to rectify
this later in the week, as does tony. the precise positioning of the
thin gray ribbon cable that used to go to the narrow SCSI card in slot
9 seems to be an important unknown, since i understand this cable is
responsible for carrying raid swap information around.
in the event, the restore finishes around 0130 monday, and the servers
(mostly) restart happily.
summary: any migration involving a dump-and-restore is highly traumatic.
we were right to avoid this when we went to the F330s, and we'd be well
advised to avoid it in any future migrations - data copy to a new toaster
is tolerable for our datasets, but copy-and-back isn't viable any more.
the mixed-shelf configuration isn't exactly bug-free yet. sleep is a much
under-rated phenomenon.
Tom Yates - Unix Chap - The Mathworks, Inc. - +1 (508) 647 7561
MAG#65061 DoD#0135 AMA#461546
1024/CFDFDE39 0C E7 46 60 BB 96 87 05 04 BD FB F8 BB 20 C1 8C