Well, adding the PCI FC-AL contollers that NetApp sent to replace the
onboard FC-AL for our clustered F740 pair did _not_ go smoothly.
This marks two times in a row now that I've attempted to use 'cf
takeover' in order to prevent downtime and it has lead to downtime of
over an hour. The previous time, we were bitten by the uptime bug.
This time, who knows.
When I tried to do a 'cf takeover' on filer A, for some reason, it
decided to mark one of filer B's data disks as a hot-spare, and then
it panic'ed:
subzero> Sun Nov 14 02:56:35 EST [subzero: rc]: Cluster monitor: takeover initiated by operator
Sun Nov 14 02:56:35 EST [subzero: cf_main]: Cluster monitor: UP --> TAKEOVER
Sun Nov 14 02:56:35 EST [subzero: cf_takeover]: Cluster monitor: takeover started
Sun Nov 14 02:56:48 EST [subzero: disk_admin]: Resetting all devices on ISP2100 in slot 1
Sun Nov 14 02:56:55 EST [subzero: cf_takeover]: Marking disk 1.8 as a "hot spare" disk.Sun Nov 14 02:56:56 EST [subzero:
raid_disk_admin]: One disk is missing from volume partner:cim0a, RAID group 1.
A "hot spare" disk is available and the missing disk
will be reconstructed on the spare disk.
PANIC: wafl_check_vbns: vbn too big on Sun Nov 14 07:56:57 1999
Cluster monitor: panic during takeover
Cluster monitor: takeover capability will be disabled on reboot
subzero came back up okay. I hoped the other filer (viking) would come
back up okay and begin a RAID reconstuct, but it was convinced the
filesystem was hosed:
Disk 0a.4 is reserved for "hot spare"
Disk 0a.8 is reserved for "hot spare"
Disk 0a.13 is reserved for "hot spare"
3 disks are reserved for "hot spare".
27 disks are owned by clustered failover partner.
Sun Nov 14 08:03:13 GMT [rc]: A disk is missing from one or more RAID groups. System starting in degraded mode.
disk: RAID label 1 and 2: magic/time/gen/shutdown fsid/rgid/rgdn/total
0a.0 : RAID/942566217/1881/0 20036d6/0/3/17/ RAID/942566217/1881/0 20036d6/0/3/17/
0a.1 : RAID/942566217/1881/0 20036d6/0/4/17/ RAID/942566217/1881/0 20036d6/0/4/17/
0a.2 : RAID/942566217/1881/0 20036d6/0/1/17/ RAID/942566217/1881/0 20036d6/0/1/17/
0a.3 : RAID/942566217/1881/0 20036d6/1/3/17/ RAID/942566217/1881/0 20036d6/1/3/17/
0a.4 : SPARE/0/0/0 ffffffff/-1/-1/1/ SPARE/0/0/0 ffffffff/-1/-1/1/
0a.5 : RAID/942566217/1881/0 20036d6/0/0/17/ RAID/942566217/1881/0 20036d6/0/0/17/
0a.6 : RAID/942566217/1881/0 20036d6/0/5/17/ RAID/942566217/1881/0 20036d6/0/5/17/
0a.8 : SPARE/0/0/0 ffffffff/-1/-1/1/ SPARE/0/0/0 ffffffff/-1/-1/1/
0a.9 : RAID/942566217/1881/0 20036d6/1/5/17/ RAID/942566217/1881/0 20036d6/1/5/17/
0a.10 : RAID/942566217/1881/0 20036d6/1/2/17/ RAID/942566217/1881/0 20036d6/1/2/17/
0a.11 : RAID/942566217/1881/0 20036d6/2/4/17/ RAID/942566217/1881/0 20036d6/2/4/17/
0a.12 : RAID/942566217/1881/0 20036d6/2/2/17/ RAID/942566217/1881/0 20036d6/2/2/17/
0a.13 : SPARE/0/0/0 ffffffff/-1/-1/1/ SPARE/0/0/0 ffffffff/-1/-1/1/
0a.14 : RAID/942566217/1881/0 20036d6/0/2/17/ RAID/942566217/1881/0 20036d6/0/2/17/
0a.16 : RAID/942566217/1881/0 20036d6/2/3/17/ RAID/942566217/1881/0 20036d6/2/3/17/
0a.17 : RAID/942566217/1881/0 20036d6/2/0/17/ RAID/942566217/1881/0 20036d6/2/0/17/
0a.18 : RAID/942566217/1881/0 20036d6/2/1/17/ RAID/942566217/1881/0 20036d6/2/1/17/
0a.19 : RAID/942566216/1659/0 36d6/0/0/2/P RAID/942566216/1659/0 36d6/0/0/2/P
0a.20 : RAID/942566216/1659/0 36d6/0/1/2/S RAID/942566216/1659/0 36d6/0/1/2/S
0a.21 : RAID/942566217/1881/0 20036d6/1/1/17/ RAID/942566217/1881/0 20036d6/1/1/17/
0a.22 : RAID/942566217/1881/0 20036d6/1/0/17/ RAID/942566217/1881/0 20036d6/1/0/17/
1.0 : SPARE/0/0/0 ffffffff/-1/-1/1/ SPARE/0/0/0 ffffffff/-1/-1/1/
1.2 : RAID/942566451/3443/0 10022f5/0/4/6/ RAID/942566451/3443/0 10022f5/0/4/6/
1.1 : RAID/942566451/3894/0 20022f5/0/1/18/ RAID/942566451/3894/0 20022f5/0/1/18/
1.4 : RAID/942566451/3894/0 20022f5/2/4/18/ RAID/942566451/3894/0 20022f5/2/4/18/
1.5 : RAID/942566451/3894/0 20022f5/2/1/18/ RAID/942566451/3894/0 20022f5/2/1/18/
1.3 : RAID/942566451/3894/0 20022f5/2/5/18/ RAID/942566451/3894/0 20022f5/2/5/18/
1.9 : RAID/942566451/3443/0 10022f5/0/0/6/ RAID/942566451/3443/0 10022f5/0/0/6/
1.8 : RAID/942566451/3443/0 10022f5/0/5/6/ RAID/942566451/3443/0 10022f5/0/5/6/
1.6 : RAID/942566451/3894/0 20022f5/0/3/18/ RAID/942566451/3894/0 20022f5/0/3/18/
1.11 : RAID/942566451/3894/0 20022f5/1/5/18/ RAID/942566451/3894/0 20022f5/1/5/18/
1.12 : RAID/942566451/3894/0 20022f5/2/0/18/ RAID/942566451/3894/0 20022f5/2/0/18/
1.14 : RAID/942566451/3894/0 20022f5/1/4/18/ RAID/942566451/3894/0 20022f5/1/4/18/
1.16 : RAID/942566451/3894/0 20022f5/2/2/18/ RAID/942566451/3894/0 20022f5/2/2/18/
1.17 : RAID/942566451/3443/0 10022f5/0/2/6/ RAID/942566451/3443/0 10022f5/0/2/6/
1.18 : RAID/942566451/3443/0 10022f5/0/3/6/ RAID/942566451/3443/0 10022f5/0/3/6/
1.19 : RAID/942566451/3769/0 22f5/0/0/2/P RAID/942566451/3769/0 22f5/0/0/2/P
1.20 : RAID/942566451/3769/0 22f5/0/1/2/S RAID/942566451/3769/0 22f5/0/1/2/S
1.21 : RAID/942566451/3443/0 10022f5/0/1/6/ RAID/942566451/3443/0 10022f5/0/1/6/
1.22 : RAID/942566451/3894/0 20022f5/0/4/18/ RAID/942566451/3894/0 20022f5/0/4/18/
1.24 : RAID/942566451/3894/0 20022f5/0/0/18/ RAID/942566451/3894/0 20022f5/0/0/18/
1.25 : RAID/942566451/3894/0 20022f5/0/5/18/ RAID/942566451/3894/0 20022f5/0/5/18/
1.26 : RAID/942566451/3894/0 20022f5/0/2/18/ RAID/942566451/3894/0 20022f5/0/2/18/
1.28 : RAID/942566451/3894/0 20022f5/1/2/18/ RAID/942566451/3894/0 20022f5/1/2/18/
1.27 : RAID/942566451/3894/0 20022f5/1/1/18/ RAID/942566451/3894/0 20022f5/1/1/18/
1.29 : RAID/942566451/3894/0 20022f5/1/0/18/ RAID/942566451/3894/0 20022f5/1/0/18/
1.30 : RAID/942566451/3894/0 20022f5/1/3/18/ RAID/942566451/3894/0 20022f5/1/3/18/
1.10 : RAID/942566451/3894/0 20022f5/2/3/18/ RAID/942566451/3894/0 20022f5/2/3/18/
Unclean shutdown in degraded mode without NVRAM protection.
Filesystem may be scrambled.
Please contact Network Appliance Customer Support.
(You must boot from floppy to take any further action)
Joy. I presume this is because subzero panic'd. Had subzero not
panic'd I'm guessing subzero would have begun a reconstruct for
viking's disks (after it blew-away that data disk for whatever
reason).
Anyway, at this point, we decided to put in the PCI FC-AL hoping the
filer would notice the disks on reboot (I hadn't yet diagnosed why a
data disk was missing). No go. So we ran wackz. 20 minutes later wackz
found no problems. Rebooted the filer and it came up and began a
reconstruct. It chose one of its other spares, not the former
data-disk spare to reconstruct onto. (During reconstruct, the filer
'recovered error' twice on another disk in the same RAID group, glad
it didn't decide to throw that disk or I guess I would have been
trying to figure how to get that hot-spare-used-to-be-a-data-disk back
as a data disk.)
I didn't want to risk another 'cf takeover' (esp during a reconstuct),
so I just halted the other filer to add its PCI FC-AL. No incident
there.
Case has been opened with Netapp.
This is getting really old.
Oh yeah, 5.2.3 on both filers. Both filers have succefully done a 'cf
takeover' during the past (during testing a few different times).
Apparently, 'cf takeover' only fails when you actually _need_ it.
On the bright side, NetApp customer support was pretty helpful. It was
a multi-national effort (I was coordinating from Atlanta, the person
performing the maintenence at our co-lo was in Sunnyvale and the NA
customer support person (Renee) was in the Netherlands).
j.