Note - this happened in the past. I'm just rehashing my hellish night
for some thoughts from this list. This might be a lot of words.
Some background info:
The plan was to move roughly 1.5T of images from a broken Sun 7110 ZFS
pair to our newer FAS3250. I had been doing nightly rsyncs over NFS
between the two systems, which took roughly 8-9hrs to traverse the 1.5M
files. The night I was going to execute the move, we had a network
"event"...basically a catastrophic situation where a huge influx of
network traffic inside of our cage caused two core switches to
simultaneously reboot themselves (trying to put it nicely, our network
architecture is "original," built through 10 years of a business that
never dedicated more than 1% to the IT budget).
Anyway, the network event caused the Sun 7110 to basically explode,
never actually got the data back online. Luckily I had done a --delete
dryrun the night before, so the new location was mostly updated and I
had a list of files to remove. Though the 7110 isn't what I want to talk
about.
One of our other FAS2240's went offline as well. I come to find out that
the controllers were not redundantly connected to the network.
Apparently because we didn't have enough fiber ports in the switch at
the time when it was installed. Awesome.
The 2240's management IP and service IPs never came back online. The
switch reported sending packets to the device, but the filer never
replied. I should have had the NOC physically remove the ports and
reconnect them, or reconnect them to a different port, but this was
about 2AM and I wasn't thinking correctly.
The SP was working fine. I could get on the console of the 2240, and it
acted like there was no problem. Its partner never thought anything was
wrong either, so there was no failover.
I was going to force a takeover on its partner, but the warning about a
forced takeover that could result in data corruption scared me off.
At this point, I'm kind of freaking out. I had just moved, and of course
my landline wasn't set up, and I had no cell signal to call Support. And
the support site was down/not working (great luck).
I figure my best shot at this point to getting the filer back is
rebooting the node that thinks it is still primary. I type reboot into
the console through the SP and hope for the best. The console goes
dark...FOR 45 MINUTES. After 45 minutes, it finally comes back, and all
is well. I ALMOST forced a poweroff through the SP...good thing I didn't
do that!
Why did it take 45 minutes to reboot?? Was it flushing cache to disk? I
got really scared thinking this filer wasn't going to come back.
Thoughts? Sorry for the long-winded post.
-Phil
--
_____________________
Phil Gardner
PGP Key ID 0xFECC890C
OTR Fingerprint 6707E9B8 BD6062D3 5010FE8B 36D614E3 D2F80538