A tough situation - what would you have done?

25 Apr 2014


      Note - this happened in the past. I'm just rehashing my hellish night 
for some thoughts from this list. This might be a lot of words.
Some background info:
The plan was to move roughly 1.5T of images from a broken Sun 7110 ZFS 
pair to our newer FAS3250. I had been doing nightly rsyncs over NFS 
between the two systems, which took roughly 8-9hrs to traverse the 1.5M 
files. The night I was going to execute the move, we had a network 
"event"...basically a catastrophic situation where a huge influx of 
network traffic inside of our cage caused two core switches to 
simultaneously reboot themselves (trying to put it nicely, our network 
architecture is "original," built through 10 years of a business that 
never dedicated more than 1% to the IT budget).
Anyway, the network event caused the Sun 7110 to basically explode, 
never actually got the data back online. Luckily I had done a --delete 
dryrun the night before, so the new location was mostly updated and I 
had a list of files to remove. Though the 7110 isn't what I want to talk 
about.
One of our other FAS2240's went offline as well. I come to find out that 
the controllers were not redundantly connected to the network. 
Apparently because we didn't have enough fiber ports in the switch at 
the time when it was installed. Awesome.
The 2240's management IP and service IPs never came back online. The 
switch reported sending packets to the device, but the filer never 
replied. I should have had the NOC physically remove the ports and 
reconnect them, or reconnect them to a different port, but this was 
about 2AM and I wasn't thinking correctly.
The SP was working fine. I could get on the console of the 2240, and it 
acted like there was no problem. Its partner never thought anything was 
wrong either, so there was no failover.
I was going to force a takeover on its partner, but the warning about a 
forced takeover that could result in data corruption scared me off.
At this point, I'm kind of freaking out. I had just moved, and of course 
my landline wasn't set up, and I had no cell signal to call Support. And 
the support site was down/not working (great luck).
I figure my best shot at this point to getting the filer back is 
rebooting the node that thinks it is still primary. I type reboot into 
the console through the SP and hope for the best. The console goes 
dark...FOR 45 MINUTES. After 45 minutes, it finally comes back, and all 
is well. I ALMOST forced a poweroff through the SP...good thing I didn't 
do that!
Why did it take 45 minutes to reboot?? Was it flushing cache to disk? I 
got really scared thinking this filer wasn't going to come back.
Thoughts? Sorry for the long-winded post.
-Phil
-- 
_____________________
Phil Gardner
PGP Key ID 0xFECC890C
OTR Fingerprint 6707E9B8 BD6062D3 5010FE8B 36D614E3 D2F80538

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

A tough situation - what would you have done?