Well, adding the PCI FC-AL contollers that NetApp sent to replace the onboard FC-AL for our clustered F740 pair did _not_ go smoothly.
This marks two times in a row now that I've attempted to use 'cf takeover' in order to prevent downtime and it has lead to downtime of over an hour. The previous time, we were bitten by the uptime bug. This time, who knows.
When I tried to do a 'cf takeover' on filer A, for some reason, it decided to mark one of filer B's data disks as a hot-spare, and then it panic'ed:
subzero> Sun Nov 14 02:56:35 EST [subzero: rc]: Cluster monitor: takeover initiated by operator Sun Nov 14 02:56:35 EST [subzero: cf_main]: Cluster monitor: UP --> TAKEOVER Sun Nov 14 02:56:35 EST [subzero: cf_takeover]: Cluster monitor: takeover started Sun Nov 14 02:56:48 EST [subzero: disk_admin]: Resetting all devices on ISP2100 in slot 1 Sun Nov 14 02:56:55 EST [subzero: cf_takeover]: Marking disk 1.8 as a "hot spare" disk.Sun Nov 14 02:56:56 EST [subzero: raid_disk_admin]: One disk is missing from volume partner:cim0a, RAID group 1. A "hot spare" disk is available and the missing disk will be reconstructed on the spare disk.
PANIC: wafl_check_vbns: vbn too big on Sun Nov 14 07:56:57 1999
Cluster monitor: panic during takeover Cluster monitor: takeover capability will be disabled on reboot
subzero came back up okay. I hoped the other filer (viking) would come back up okay and begin a RAID reconstuct, but it was convinced the filesystem was hosed:
Disk 0a.4 is reserved for "hot spare" Disk 0a.8 is reserved for "hot spare" Disk 0a.13 is reserved for "hot spare"
3 disks are reserved for "hot spare".
27 disks are owned by clustered failover partner. Sun Nov 14 08:03:13 GMT [rc]: A disk is missing from one or more RAID groups. System starting in degraded mode. disk: RAID label 1 and 2: magic/time/gen/shutdown fsid/rgid/rgdn/total 0a.0 : RAID/942566217/1881/0 20036d6/0/3/17/ RAID/942566217/1881/0 20036d6/0/3/17/ 0a.1 : RAID/942566217/1881/0 20036d6/0/4/17/ RAID/942566217/1881/0 20036d6/0/4/17/ 0a.2 : RAID/942566217/1881/0 20036d6/0/1/17/ RAID/942566217/1881/0 20036d6/0/1/17/ 0a.3 : RAID/942566217/1881/0 20036d6/1/3/17/ RAID/942566217/1881/0 20036d6/1/3/17/ 0a.4 : SPARE/0/0/0 ffffffff/-1/-1/1/ SPARE/0/0/0 ffffffff/-1/-1/1/ 0a.5 : RAID/942566217/1881/0 20036d6/0/0/17/ RAID/942566217/1881/0 20036d6/0/0/17/ 0a.6 : RAID/942566217/1881/0 20036d6/0/5/17/ RAID/942566217/1881/0 20036d6/0/5/17/ 0a.8 : SPARE/0/0/0 ffffffff/-1/-1/1/ SPARE/0/0/0 ffffffff/-1/-1/1/ 0a.9 : RAID/942566217/1881/0 20036d6/1/5/17/ RAID/942566217/1881/0 20036d6/1/5/17/ 0a.10 : RAID/942566217/1881/0 20036d6/1/2/17/ RAID/942566217/1881/0 20036d6/1/2/17/ 0a.11 : RAID/942566217/1881/0 20036d6/2/4/17/ RAID/942566217/1881/0 20036d6/2/4/17/ 0a.12 : RAID/942566217/1881/0 20036d6/2/2/17/ RAID/942566217/1881/0 20036d6/2/2/17/ 0a.13 : SPARE/0/0/0 ffffffff/-1/-1/1/ SPARE/0/0/0 ffffffff/-1/-1/1/ 0a.14 : RAID/942566217/1881/0 20036d6/0/2/17/ RAID/942566217/1881/0 20036d6/0/2/17/ 0a.16 : RAID/942566217/1881/0 20036d6/2/3/17/ RAID/942566217/1881/0 20036d6/2/3/17/ 0a.17 : RAID/942566217/1881/0 20036d6/2/0/17/ RAID/942566217/1881/0 20036d6/2/0/17/ 0a.18 : RAID/942566217/1881/0 20036d6/2/1/17/ RAID/942566217/1881/0 20036d6/2/1/17/ 0a.19 : RAID/942566216/1659/0 36d6/0/0/2/P RAID/942566216/1659/0 36d6/0/0/2/P 0a.20 : RAID/942566216/1659/0 36d6/0/1/2/S RAID/942566216/1659/0 36d6/0/1/2/S 0a.21 : RAID/942566217/1881/0 20036d6/1/1/17/ RAID/942566217/1881/0 20036d6/1/1/17/ 0a.22 : RAID/942566217/1881/0 20036d6/1/0/17/ RAID/942566217/1881/0 20036d6/1/0/17/ 1.0 : SPARE/0/0/0 ffffffff/-1/-1/1/ SPARE/0/0/0 ffffffff/-1/-1/1/ 1.2 : RAID/942566451/3443/0 10022f5/0/4/6/ RAID/942566451/3443/0 10022f5/0/4/6/ 1.1 : RAID/942566451/3894/0 20022f5/0/1/18/ RAID/942566451/3894/0 20022f5/0/1/18/ 1.4 : RAID/942566451/3894/0 20022f5/2/4/18/ RAID/942566451/3894/0 20022f5/2/4/18/ 1.5 : RAID/942566451/3894/0 20022f5/2/1/18/ RAID/942566451/3894/0 20022f5/2/1/18/ 1.3 : RAID/942566451/3894/0 20022f5/2/5/18/ RAID/942566451/3894/0 20022f5/2/5/18/ 1.9 : RAID/942566451/3443/0 10022f5/0/0/6/ RAID/942566451/3443/0 10022f5/0/0/6/ 1.8 : RAID/942566451/3443/0 10022f5/0/5/6/ RAID/942566451/3443/0 10022f5/0/5/6/ 1.6 : RAID/942566451/3894/0 20022f5/0/3/18/ RAID/942566451/3894/0 20022f5/0/3/18/ 1.11 : RAID/942566451/3894/0 20022f5/1/5/18/ RAID/942566451/3894/0 20022f5/1/5/18/ 1.12 : RAID/942566451/3894/0 20022f5/2/0/18/ RAID/942566451/3894/0 20022f5/2/0/18/ 1.14 : RAID/942566451/3894/0 20022f5/1/4/18/ RAID/942566451/3894/0 20022f5/1/4/18/ 1.16 : RAID/942566451/3894/0 20022f5/2/2/18/ RAID/942566451/3894/0 20022f5/2/2/18/ 1.17 : RAID/942566451/3443/0 10022f5/0/2/6/ RAID/942566451/3443/0 10022f5/0/2/6/ 1.18 : RAID/942566451/3443/0 10022f5/0/3/6/ RAID/942566451/3443/0 10022f5/0/3/6/ 1.19 : RAID/942566451/3769/0 22f5/0/0/2/P RAID/942566451/3769/0 22f5/0/0/2/P 1.20 : RAID/942566451/3769/0 22f5/0/1/2/S RAID/942566451/3769/0 22f5/0/1/2/S 1.21 : RAID/942566451/3443/0 10022f5/0/1/6/ RAID/942566451/3443/0 10022f5/0/1/6/ 1.22 : RAID/942566451/3894/0 20022f5/0/4/18/ RAID/942566451/3894/0 20022f5/0/4/18/ 1.24 : RAID/942566451/3894/0 20022f5/0/0/18/ RAID/942566451/3894/0 20022f5/0/0/18/ 1.25 : RAID/942566451/3894/0 20022f5/0/5/18/ RAID/942566451/3894/0 20022f5/0/5/18/ 1.26 : RAID/942566451/3894/0 20022f5/0/2/18/ RAID/942566451/3894/0 20022f5/0/2/18/ 1.28 : RAID/942566451/3894/0 20022f5/1/2/18/ RAID/942566451/3894/0 20022f5/1/2/18/ 1.27 : RAID/942566451/3894/0 20022f5/1/1/18/ RAID/942566451/3894/0 20022f5/1/1/18/ 1.29 : RAID/942566451/3894/0 20022f5/1/0/18/ RAID/942566451/3894/0 20022f5/1/0/18/ 1.30 : RAID/942566451/3894/0 20022f5/1/3/18/ RAID/942566451/3894/0 20022f5/1/3/18/ 1.10 : RAID/942566451/3894/0 20022f5/2/3/18/ RAID/942566451/3894/0 20022f5/2/3/18/ Unclean shutdown in degraded mode without NVRAM protection. Filesystem may be scrambled.
Please contact Network Appliance Customer Support.
(You must boot from floppy to take any further action)
Joy. I presume this is because subzero panic'd. Had subzero not panic'd I'm guessing subzero would have begun a reconstruct for viking's disks (after it blew-away that data disk for whatever reason).
Anyway, at this point, we decided to put in the PCI FC-AL hoping the filer would notice the disks on reboot (I hadn't yet diagnosed why a data disk was missing). No go. So we ran wackz. 20 minutes later wackz found no problems. Rebooted the filer and it came up and began a reconstruct. It chose one of its other spares, not the former data-disk spare to reconstruct onto. (During reconstruct, the filer 'recovered error' twice on another disk in the same RAID group, glad it didn't decide to throw that disk or I guess I would have been trying to figure how to get that hot-spare-used-to-be-a-data-disk back as a data disk.)
I didn't want to risk another 'cf takeover' (esp during a reconstuct), so I just halted the other filer to add its PCI FC-AL. No incident there.
Case has been opened with Netapp.
This is getting really old.
Oh yeah, 5.2.3 on both filers. Both filers have succefully done a 'cf takeover' during the past (during testing a few different times). Apparently, 'cf takeover' only fails when you actually _need_ it.
On the bright side, NetApp customer support was pretty helpful. It was a multi-national effort (I was coordinating from Atlanta, the person performing the maintenence at our co-lo was in Sunnyvale and the NA customer support person (Renee) was in the Netherlands).
j.