well, that was an interesting weekend.
we dump/restore upgraded an F330 to a mixed-disc configuration. i thought i'd share some of the high points in case anyone else is thinking of doing this. i'd add that i'm typing this at 0300 on monday, so excuse the typos. this is also quite long, so feel free to ignore completely. if you just want to find my recommendations, look for digits in column one.
that F330 had three shelves of 4Gb FSCSI discs. we wanted to upgrade to two shelves of 4Gb FSCSI discs and one shelf of 9Gb FWSCSI discs.
ONTAP 4.2 is a prerequisite for this, so we'd been running stably at 4.2 for about ten days prior to the weekend (thanks, ja mah). our vendor (OSS) was lending us an F330 with two shelves of 9Gb FWSCSIs to dump my toaster to, as i'd prefer not to get 98% of the way through a tape restore and have i/o errors (thanks, OSS).
design note: first impression on the eurologics shelves. these things came from the hans giger school of weird and creepy alien design. also, they look substantially less well-made than the old storageworks shelves. i'm putting that down to the lack of a guide bar ensuring that all the discs are precisely lined up. not that it matters. i hope.
having installed the loaner toaster, the question of how to dump arose. i'd already borrowed an Ultra-2 from another project, and set it up as administration host for both local and loaner toasters. we'd heard that good things arose from fddi - dump times very significantly faster than even 100bT - so we tried using fddi/cddi on all three boxes connected to our cddi/fddi concentrator. the dump failed after about ninety minutes. just hung. i presumed that this was due to the cddi drivers being somewhat old, and decided to punt to 100bT, connecting all machines to our main switch (catalyst 5000). the dump hung after about ninety minutes. i presumed that this was an old dump bug biting us and down-rev'ed the local toaster to 4.0.1cD10 (which it had been running stably for some while before this) and restarted the dump over 100bT. the dump hung after about ninety minutes. i presumed i was being stupid and needed sleep and went home.
in the light of day, some console messages gave us a clue to the truth: our cisco router was lying about the ethernet address of the administration host, and the local toaster was losing touch with the admin host. although this is a feature of our local network topology, i think recommendation one has to be 1) use a private network. i don't care how good your main networking kit is, you can do without the other packets.
we put them onto a private fddi/cddi conentrator and started the dump. whilst it didn't hang, i wasn't getting stellar dump times, so we aborted the dump, and moved everything onto a private 100bT switch. dump times were much better, which i put down to the fddi drivers on the Sun being somewhat crufty, whilst the FE drivers are cutting-edge. recommendation two: 2) doesn't matter how great fddi is if the drivers aren't stellar. FE may well be better.
if anyone's curious, the dump syntax was brutally simple: admin# rsh local "dump 0f - /" | rsh loaner "restore rfD - /restore"
dumps finished about midnight (after an excellent sushi dinner for the unix and networking crew. thanks, shogun 9.). it took fifteen hours to dump 70Gb over FE, with fairly aggressive writing on the destination toaster (continuous writes and CPU at 70-80%). which leads me to 3) the netapp white paper on migration suggests doing a level 0 to your destination machine in work time a day or so beforehand, and following up with a level 1 in downtime. i really wish i'd done that instead of doing my level zero in (expensive) downtime.
i then zeroed out all the disc labels on my toaster (floppy boot, option 5) and pulled off the third shelf. removed slot 9 SCSI card, replaced it with FWSCSI card, tried to boot. F330 hangs in "probing devices" phase of boot. thanks to some excellent late-night phone support by tony liu (MANY thanks, tony) it turns out that the FWSCSI card requires firmware 1.6, although ONTAP 4.2 doesn't. i'd already asked my vendor that question directly and got a different answer (grumble).
emergency firmware upgrade to 1.6 ensues. toaster now boots, so i can get on with making a new file system (floppy boot, option 4). you've never known true nervousness until you watch all your users' home directories being newfs'ed.
we've been told that the correct way to get a 9Gb and a 4Gb hot spare is to allow the newfs to allocate a 9Gb hot spare, and then RAID SWAP in a 4Gb into a slot left unoccupied in shelf 0. when we try the RAID SWAP, the bus fails to reset properly, and the damn thing goes into a variety of raid panic situations. all discs in the shelf on slot 0 are also showing amber "failure" lights, although they appear to work fine. lengthy calls to netapp fail to rectify this problem, and it's getting to the point where the restore won't finish until after 0001 monday, which is getting closer to the promised 0700 "everything back" time than i'd like, so i decided to go with a system which seems to handle the hot spare disc correctly - in the event of a disc failure, it RAID reconstructs to the 9Gb hot spare - but can't do hot swap. it's no worse than my old 450s, though i hope netapp will be able to rectify this later in the week, as does tony. the precise positioning of the thin gray ribbon cable that used to go to the narrow SCSI card in slot 9 seems to be an important unknown, since i understand this cable is responsible for carrying raid swap information around.
in the event, the restore finishes around 0130 monday, and the servers (mostly) restart happily.
summary: any migration involving a dump-and-restore is highly traumatic. we were right to avoid this when we went to the F330s, and we'd be well advised to avoid it in any future migrations - data copy to a new toaster is tolerable for our datasets, but copy-and-back isn't viable any more. the mixed-shelf configuration isn't exactly bug-free yet. sleep is a much under-rated phenomenon.
Tom Yates - Unix Chap - The Mathworks, Inc. - +1 (508) 647 7561 MAG#65061 DoD#0135 AMA#461546 1024/CFDFDE39 0C E7 46 60 BB 96 87 05 04 BD FB F8 BB 20 C1 8C
On Mon, 13 Oct 1997, Tom Yates wrote:
well, that was an interesting weekend.
The sushi was excellent and the champagne at 4:30 am was, um, different.
we've been told that the correct way to get a 9Gb and a 4Gb hot spare is to allow the newfs to allocate a 9Gb hot spare, and then RAID SWAP in a 4Gb into a slot left unoccupied in shelf 0. when we try the RAID SWAP, the bus fails to reset properly, and the damn thing goes into a variety of raid panic situations. all discs in the shelf on slot 0 are also showing amber "failure" lights, although they appear to work fine. lengthy calls to netapp fail to rectify this problem, and it's getting to the point where the restore won't finish until after 0001 monday, which is getting closer to the promised 0700 "everything back" time than i'd like, so i decided to go with a system which seems to handle the hot spare disc correctly - in the event of a disc failure, it RAID reconstructs to the 9Gb hot spare - but can't do hot swap. it's no worse than my old 450s, though i hope netapp will be able to rectify this later in the week, as does tony. the precise positioning of the thin gray ribbon cable that used to go to the narrow SCSI card in slot 9 seems to be an important unknown, since i understand this cable is responsible for carrying raid swap information around.
Other than the amber lights on shelf 0 and the questionable thin ribbon cable between scsi cards and the lost ability to do hot swapping under this mixed disk configuration, one other question comes to mind:
If a 4GB disk fails and is reconstructed on the 9GB hot spare, then is cold swapped with a new 4GB disk, can we fail out the 9GB disk to reconstruct the new 4GB disk?
t'other tired admin...
-Caroline