Greetings all. Do you ever find issues with takeover/giveback events? We've had two relatively recently which resulted in service outages.
Some months ago we had a service outage during an upgrade to 8.1.4p1 on our 3240 pair. During one of the givebacks, the controller came back without any networking. NetApp speculates the 10GE driver didn't initialize on when the controller rebooted. Either way, our storage service was AWOL as a result.
And then Friday we had a real controller crash. This one occurred on an offsite HA pair of 2040s. Lately we have been chasing down a PCI error on one controller; NetApp eventually came out to replace that controller. After moving operations via 'cf takeover', it turned out that the new controller had very old SP firmware, older than is supported for the running 8.1.4.P1. The tech called an engineer for guidance, but just as the engineer came on WebEx the running controller panicked; all services were offline. (NetApp is still investigating this; we're uploading different files and working through the process.)
While we can look at each of these and consider them anomalies for different reasons, it's still very worrisome that the core availability technology has twice resulted in service outages.
Any thoughts?
Thanks, Andrew
Based on your comments about running an unsupported downlevel so firmware I would start by downloading "config advisor" from the netapp support site and running a scan against your pairs. ________________________________________ From: toasters-bounces@teaparty.net [toasters-bounces@teaparty.net] on behalf of Andrew Laurence [atlauren@me.com] Sent: Monday, July 28, 2014 6:46 PM To: toasters@teaparty.net Subject: Issues controller takeover/givebacks
Greetings all. Do you ever find issues with takeover/giveback events? We've had two relatively recently which resulted in service outages.
Some months ago we had a service outage during an upgrade to 8.1.4p1 on our 3240 pair. During one of the givebacks, the controller came back without any networking. NetApp speculates the 10GE driver didn't initialize on when the controller rebooted. Either way, our storage service was AWOL as a result.
And then Friday we had a real controller crash. This one occurred on an offsite HA pair of 2040s. Lately we have been chasing down a PCI error on one controller; NetApp eventually came out to replace that controller. After moving operations via 'cf takeover', it turned out that the new controller had very old SP firmware, older than is supported for the running 8.1.4.P1. The tech called an engineer for guidance, but just as the engineer came on WebEx the running controller panicked; all services were offline. (NetApp is still investigating this; we're uploading different files and working through the process.)
While we can look at each of these and consider them anomalies for different reasons, it's still very worrisome that the core availability technology has twice resulted in service outages.
Any thoughts?
Thanks, Andrew
-- Andrew Laurence Office of Information Technology atlauren@uci.edu University of California, Irvine
_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
We've experienced a few problems with takeover/giveback events - but not for the reasons you mentioned. We've had issues due to data ontap bugs and also due to controller load that resulted in either failed giveback or the cluster to panic. Our track record for successful to/gb isn't stellar, so we're always hesitant when it has to happen, even with all of the steps in place to work around past issues.
-- Mike Garrison
On Mon, Jul 28, 2014 at 6:46 PM, Andrew Laurence atlauren@me.com wrote:
Greetings all. Do you ever find issues with takeover/giveback events? We've had two relatively recently which resulted in service outages.
Some months ago we had a service outage during an upgrade to 8.1.4p1 on our 3240 pair. During one of the givebacks, the controller came back without any networking. NetApp speculates the 10GE driver didn't initialize on when the controller rebooted. Either way, our storage service was AWOL as a result.
And then Friday we had a real controller crash. This one occurred on an offsite HA pair of 2040s. Lately we have been chasing down a PCI error on one controller; NetApp eventually came out to replace that controller. After moving operations via 'cf takeover', it turned out that the new controller had very old SP firmware, older than is supported for the running 8.1.4.P1. The tech called an engineer for guidance, but just as the engineer came on WebEx the running controller panicked; all services were offline. (NetApp is still investigating this; we're uploading different files and working through the process.)
While we can look at each of these and consider them anomalies for different reasons, it's still very worrisome that the core availability technology has twice resulted in service outages.
Any thoughts?
Thanks, Andrew
-- Andrew Laurence Office of Information Technology atlauren@uci.edu University of California, Irvine
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters