Note - this happened in the past. I'm just rehashing my hellish night for some thoughts from this list. This might be a lot of words.
Some background info:
The plan was to move roughly 1.5T of images from a broken Sun 7110 ZFS pair to our newer FAS3250. I had been doing nightly rsyncs over NFS between the two systems, which took roughly 8-9hrs to traverse the 1.5M files. The night I was going to execute the move, we had a network "event"...basically a catastrophic situation where a huge influx of network traffic inside of our cage caused two core switches to simultaneously reboot themselves (trying to put it nicely, our network architecture is "original," built through 10 years of a business that never dedicated more than 1% to the IT budget).
Anyway, the network event caused the Sun 7110 to basically explode, never actually got the data back online. Luckily I had done a --delete dryrun the night before, so the new location was mostly updated and I had a list of files to remove. Though the 7110 isn't what I want to talk about.
One of our other FAS2240's went offline as well. I come to find out that the controllers were not redundantly connected to the network. Apparently because we didn't have enough fiber ports in the switch at the time when it was installed. Awesome.
The 2240's management IP and service IPs never came back online. The switch reported sending packets to the device, but the filer never replied. I should have had the NOC physically remove the ports and reconnect them, or reconnect them to a different port, but this was about 2AM and I wasn't thinking correctly.
The SP was working fine. I could get on the console of the 2240, and it acted like there was no problem. Its partner never thought anything was wrong either, so there was no failover.
I was going to force a takeover on its partner, but the warning about a forced takeover that could result in data corruption scared me off.
At this point, I'm kind of freaking out. I had just moved, and of course my landline wasn't set up, and I had no cell signal to call Support. And the support site was down/not working (great luck).
I figure my best shot at this point to getting the filer back is rebooting the node that thinks it is still primary. I type reboot into the console through the SP and hope for the best. The console goes dark...FOR 45 MINUTES. After 45 minutes, it finally comes back, and all is well. I ALMOST forced a poweroff through the SP...good thing I didn't do that!
Why did it take 45 minutes to reboot?? Was it flushing cache to disk? I got really scared thinking this filer wasn't going to come back.
Thoughts? Sorry for the long-winded post.
-Phil