Re: BURT 519766 panic on production 3270

3 Jan 2013

      Hello all,
I have seen this happen at a few customer sites and for any errors the new
NetApp policy seems to be raplacing the hardware. In one instance the
system board and all PCI cards were replaced.
This was the panic string:
Uncorrectable Machine Check Error at CPU0. MC5 Error:
STATUS<0xb200001084200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen));
Root Port(0,6,0): DevStatus(Corr), CorrErr(Rcvr).  in SK process
idle_thread0 on release 8.1
Regards,
Unnikrishnan KP
On 3 January 2013 12:15, Steffen Knauf sknauf@chipxonio.de wrote:
...
hi,****

we're running into the same error (FAS3240):****

Uncorrectable Machine Check Error at CPU3. MC5 Error:
STATUS<0xb200000080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen));
PLX PCI-E switch on Controller. Root Port(0,6,0):
SecStatus(RcvMstAbt,RcvSysErr); Br[8624](9,0,0): Status(SigSysErr),
DevStatus(Corr,NFatal,UnSup), CorrErr(AdvsNF), UCorrErr(UsReq),
FirstUCorrErr(UsReq), Hdr[0](HdrLen(1),AddrType(0),Attr(0),Tc(0),Type(0)
,Format(2)), Hdr[1]((0x70090f)), Hdr[2]((0xdf50404c)), Hdr[3]((0x1c00)).
Problem Summary:
Device Br[8624](9,0,0) reported seeing the following error(s):
"Unsupported Request (UsReq): Some aspect of a received PCI packet was
unsupported".

A Netapp Engineer told us that the only working solution is to replace all
FRU/Cards.****

greets****

Steffen****

*Von:* toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net]
*Im Auftrag von *Jayanathan, David
*Gesendet:* Donnerstag, 3. Januar 2013 00:55
*An:* Fletcher Cocquyt; Doug Siggins
*Cc:* netapp-users@mailman.stanford.edu; toasters@teaparty.net Lists
*Betreff:* RE: BURT 519766 panic on production 3270****

Interesting that you were advised to replace all HW/cards. We’ve hit this
three times in our environment that I know of and all times I was provided
with the following information:****

Bug Number / Title:****
519766 / FAS32xx Uncorrectable Machine Check Error****

Problem Summary:****
The storage controller suffered an interruption of service due to an
uncorrectable machine check error. The source of the error has not been
indicated.****

Recommended Solution/Workaround:****
If this is the first occurrence:****

Update BIOS FAS3200: 5.1.1 or later.****

Update SP firmware to 1.2.3 or later.****

Update Data ONTAP to 8.0.2P4 or later.****

Restart the system and monitor for any repeats.****

If this is the second occurrence:****

Replace the motherboard.****

Mark the faulty hardware for RCA under bug 519766.****

I have yet to find any references in mail of hitting this bug on any of
our systems running 8.0.2P7 or above. Two of the times we hit it were on
the same system, so on the 2nd time we upgraded to 8.0.2P4 and performed
a motherboard swap. The other time we purely did a failback and opted out
of doing a code upgrade.****

Thanks,****
David****

*From:* toasters-bounces@teaparty.net [
mailto:toasters-bounces@teaparty.net toasters-bounces@teaparty.net] *On
Behalf Of *Fletcher Cocquyt
*Sent:* Wednesday, January 02, 2013 3:14 PM
*To:* Doug Siggins
*Cc:* netapp-users@mailman.stanford.edu; toasters@teaparty.net Lists
*Subject:* Re: BURT 519766 panic on production 3270****

Dec 25 04:19:26 na03.GoCardinal.EDU Dec 25 12:20:49
[na03:mgr.stack.string:notice]: Panic string: Uncorrectable Machine Check
Error at CPU1. MC5 Error:
STATUS<0xb200001080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen));
PLX PCI-E switch on IO Exp  ****

thanks - yes the fact Netapp is immediately willing to replace most of our
HW indicates they know its an issue with our current HW****
I'd feel better recommending this plan if they could point to the specific
HW issue in the current HW and demonstrate how its fixed in newer HW revs*

Currently those details are not public/forthcoming...****

On Jan 2, 2013, at 3:02 PM, Doug Siggins DSiggins@ma.maileig.com wrote:*

Fletcher,****
What was the panic string? Did you get a core to netapp? Sometimes support
is a bit reluctant to investigate further unless you press for a real
answer. After 2-3 core dumps with the same type panic string, I start
demanding a fix whether it be hardware or software.****

Here are two forum posts:****

https://forums.netapp.com/thread/33616****
https://forums.netapp.com/thread/35456****

I had a similar issue on an older filer. It panic'd 2-3 times. Luckily,
over time has we prepare to retire the system the load has dropped
significantly, and I haven't seen the NMI panic for 6+ months. I had
suggested we replace the system immediately and migrate off.****

I guess the rule of thumb is that if you see the panic more than once, you
should definitely think about hardware replacements.****

*From:* toasters-bounces@teaparty.net [toasters-bounces@teaparty.net] on
behalf of Fletcher Cocquyt [fcocquyt@stanford.edu]
*Sent:* Wednesday, January 02, 2013 5:21 PM
*To:* toasters@teaparty.net Lists
*Cc:* netapp-users@mailman.stanford.edu
*Subject:* BURT 519766 panic on production 3270****
Happy 2013!****

One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 -
lump of coal ?****

The good news is, when our system panic'ed and rebooted, the failover
performed as expected so we had only a 2 second timeout logged on our ESXi
hosts, Oracle - no downtime.****

There is scarce public info on this issue and Netapp is recommending
options from "do nothing - (its rare and may never happen again)" to
"replace motherboards and all cards"****
Our 3270 clusters (we have 2 in Active:Standby mode) have been stable
since we installed them in Feb 2011.  We are on 8.1GA - Netapp support says
the issue is independent of OnTAP version.****

Anyone else encountered this issue?****
What was your action and outcome?****

thanks,****

Fletcher Cocquyt****
Stanford University School of Medicine****

Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: BURT 519766 panic on production 3270