Greetings all. Do you ever find issues with takeover/giveback events? We've had two relatively recently which resulted in service outages.
Some months ago we had a service outage during an upgrade to 8.1.4p1 on our 3240 pair. During one of the givebacks, the controller came back without any networking. NetApp speculates the 10GE driver didn't initialize on when the controller rebooted. Either way, our storage service was AWOL as a result.
And then Friday we had a real controller crash. This one occurred on an offsite HA pair of 2040s. Lately we have been chasing down a PCI error on one controller; NetApp eventually came out to replace that controller. After moving operations via 'cf takeover', it turned out that the new controller had very old SP firmware, older than is supported for the running 8.1.4.P1. The tech called an engineer for guidance, but just as the engineer came on WebEx the running controller panicked; all services were offline. (NetApp is still investigating this; we're uploading different files and working through the process.)
While we can look at each of these and consider them anomalies for different reasons, it's still very worrisome that the core availability technology has twice resulted in service outages.
Any thoughts?
Thanks,
Andrew
--
Andrew Laurence Office of Information Technology
atlauren(a)uci.edu University of California, Irvine