Dec 25 04:19:26 na03.GoCardinal.EDU Dec 25 12:20:49 [na03:mgr.stack.string:notice]: Panic string: Uncorrectable Machine Check Error at CPU1. MC5 Error: STATUS<0xb200001080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)); PLX PCI-E switch on IO Exp  

thanks - yes the fact Netapp is immediately willing to replace most of our HW indicates they know its an issue with our current HW
I'd feel better recommending this plan if they could point to the specific HW issue in the current HW and demonstrate how its fixed in newer HW revs
Currently those details are not public/forthcoming...


On Jan 2, 2013, at 3:02 PM, Doug Siggins <DSiggins@ma.maileig.com> wrote:

Fletcher,
What was the panic string? Did you get a core to netapp? Sometimes support is a bit reluctant to investigate further unless you press for a real answer. After 2-3 core dumps with the same type panic string, I start demanding a fix whether it be hardware or software.


Here are two forum posts:

https://forums.netapp.com/thread/33616
https://forums.netapp.com/thread/35456

I had a similar issue on an older filer. It panic'd 2-3 times. Luckily, over time has we prepare to retire the system the load has dropped significantly, and I haven't seen the NMI panic for 6+ months. I had suggested we replace the system immediately and migrate off.

I guess the rule of thumb is that if you see the panic more than once, you should definitely think about hardware replacements.










From: toasters-bounces@teaparty.net [toasters-bounces@teaparty.net] on behalf of Fletcher Cocquyt [fcocquyt@stanford.edu]
Sent: Wednesday, January 02, 2013 5:21 PM
To: toasters@teaparty.net Lists
Cc: netapp-users@mailman.stanford.edu
Subject: BURT 519766 panic on production 3270

Happy 2013!

One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 - lump of coal ?

The good news is, when our system panic'ed and rebooted, the failover performed as expected so we had only a 2 second timeout logged on our ESXi hosts, Oracle - no downtime.

There is scarce public info on this issue and Netapp is recommending options from "do nothing - (its rare and may never happen again)" to "replace motherboards and all cards"
Our 3270 clusters (we have 2 in Active:Standby mode) have been stable since we installed them in Feb 2011.  We are on 8.1GA - Netapp support says the issue is independent of OnTAP version.

Anyone else encountered this issue?
What was your action and outcome?

thanks,

Fletcher Cocquyt
Stanford University School of Medicine