Re: BURT 519766 panic on production 3270

List overview All Threads
Download

newer

older

wafl_cp_slovol_warning_1 with big...

PDU in netapp cabinet mysteriously...

Fletcher Cocquyt

2 Jan 2013 2 Jan '13

11:13 p.m.

Dec 25 04:19:26 na03.GoCardinal.EDU Dec 25 12:20:49 [na03:mgr.stack.string:notice]: Panic string: Uncorrectable Machine Check Error at CPU1. MC5 Error: STATUS<0xb200001080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)); PLX PCI-E switch on IO Exp

thanks - yes the fact Netapp is immediately willing to replace most of our HW indicates they know its an issue with our current HW I'd feel better recommending this plan if they could point to the specific HW issue in the current HW and demonstrate how its fixed in newer HW revs Currently those details are not public/forthcoming...

On Jan 2, 2013, at 3:02 PM, Doug Siggins DSiggins@ma.maileig.com wrote:

...

Fletcher, What was the panic string? Did you get a core to netapp? Sometimes support is a bit reluctant to investigate further unless you press for a real answer. After 2-3 core dumps with the same type panic string, I start demanding a fix whether it be hardware or software.

Here are two forum posts:

https://forums.netapp.com/thread/33616 https://forums.netapp.com/thread/35456

I had a similar issue on an older filer. It panic'd 2-3 times. Luckily, over time has we prepare to retire the system the load has dropped significantly, and I haven't seen the NMI panic for 6+ months. I had suggested we replace the system immediately and migrate off.

I guess the rule of thumb is that if you see the panic more than once, you should definitely think about hardware replacements.

From: toasters-bounces@teaparty.net [toasters-bounces@teaparty.net] on behalf of Fletcher Cocquyt [fcocquyt@stanford.edu] Sent: Wednesday, January 02, 2013 5:21 PM To: toasters@teaparty.net Lists Cc: netapp-users@mailman.stanford.edu Subject: BURT 519766 panic on production 3270

Happy 2013!

One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 - lump of coal ?

The good news is, when our system panic'ed and rebooted, the failover performed as expected so we had only a 2 second timeout logged on our ESXi hosts, Oracle - no downtime.

There is scarce public info on this issue and Netapp is recommending options from "do nothing - (its rare and may never happen again)" to "replace motherboards and all cards" Our 3270 clusters (we have 2 in Active:Standby mode) have been stable since we installed them in Feb 2011. We are on 8.1GA - Netapp support says the issue is independent of OnTAP version.

Anyone else encountered this issue? What was your action and outcome?

thanks,

Fletcher Cocquyt Stanford University School of Medicine

Attachments:

attachment.html (text/html — 9.8 KB)

Show replies by date

Jayanathan, David

2 Jan 2 Jan

11:54 p.m.

New subject: BURT 519766 panic on production 3270

Interesting that you were advised to replace all HW/cards. We've hit this three times in our environment that I know of and all times I was provided with the following information:

Bug Number / Title:

519766 / FAS32xx Uncorrectable Machine Check Error

Problem Summary:

The storage controller suffered an interruption of service due to an uncorrectable machine check error. The source of the error has not been indicated.

Recommended Solution/Workaround:

If this is the first occurrence:

- Update BIOS FAS3200: 5.1.1 or later.

- Update SP firmware to 1.2.3 or later.

- Update Data ONTAP to 8.0.2P4 or later.

- Restart the system and monitor for any repeats.

If this is the second occurrence:

- Replace the motherboard.

- Mark the faulty hardware for RCA under bug 519766.

I have yet to find any references in mail of hitting this bug on any of our systems running 8.0.2P7 or above. Two of the times we hit it were on the same system, so on the 2nd time we upgraded to 8.0.2P4 and performed a motherboard swap. The other time we purely did a failback and opted out of doing a code upgrade.

Thanks,

David

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Fletcher Cocquyt Sent: Wednesday, January 02, 2013 3:14 PM To: Doug Siggins Cc: netapp-users@mailman.stanford.edu; toasters@teaparty.net Lists Subject: Re: BURT 519766 panic on production 3270

Dec 25 04:19:26 na03.GoCardinal.EDUhttp://na03.GoCardinal.EDU Dec 25 12:20:49 [na03:mgr.stack.string:notice]: Panic string: Uncorrectable Machine Check Error at CPU1. MC5 Error: STATUS<0xb200001080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)); PLX PCI-E switch on IO Exp

On Jan 2, 2013, at 3:02 PM, Doug Siggins <DSiggins@ma.maileig.commailto:DSiggins@ma.maileig.com> wrote:

Fletcher, What was the panic string? Did you get a core to netapp? Sometimes support is a bit reluctant to investigate further unless you press for a real answer. After 2-3 core dumps with the same type panic string, I start demanding a fix whether it be hardware or software.

Here are two forum posts:

https://forums.netapp.com/thread/33616 https://forums.netapp.com/thread/35456

I had a similar issue on an older filer. It panic'd 2-3 times. Luckily, over time has we prepare to retire the system the load has dropped significantly, and I haven't seen the NMI panic for 6+ months. I had suggested we replace the system immediately and migrate off.

I guess the rule of thumb is that if you see the panic more than once, you should definitely think about hardware replacements.

________________________________ From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net] on behalf of Fletcher Cocquyt [fcocquyt@stanford.edumailto:fcocquyt@stanford.edu] Sent: Wednesday, January 02, 2013 5:21 PM To: toasters@teaparty.netmailto:toasters@teaparty.net Lists Cc: netapp-users@mailman.stanford.edumailto:netapp-users@mailman.stanford.edu Subject: BURT 519766 panic on production 3270 Happy 2013!

One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 - lump of coal ?

The good news is, when our system panic'ed and rebooted, the failover performed as expected so we had only a 2 second timeout logged on our ESXi hosts, Oracle - no downtime.

There is scarce public info on this issue and Netapp is recommending options from "do nothing - (its rare and may never happen again)" to "replace motherboards and all cards" Our 3270 clusters (we have 2 in Active:Standby mode) have been stable since we installed them in Feb 2011. We are on 8.1GA - Netapp support says the issue is independent of OnTAP version.

Anyone else encountered this issue? What was your action and outcome?

thanks,

Fletcher Cocquyt Stanford University School of Medicine

Nilsson Marcus

3 Jan 3 Jan

10:19 a.m.

New subject: BURT 519766 panic on production 3270

Hi, We had 3 panics about in 6 months on one of the heads in a NFS only 3240 cluster. We were running 8.0.2P6 (7-mode) at the time.

This was the panic string:

PANIC: Uncorrectable Machine Check Error at CPU1. MC5 Error: STATUS<0xb200001080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)); PLX PCI-E switch on Controller, Qlogic FC 4G adapter on Controller, Qlogic FC 4G adapter on Controller. Root Port(0,6,0): Status(SigSysErr), SecStatus(RcvMstAbt), DevStatus(NFatal), RootErr(UCor,NFatal), ErrSrcID(CorrSrc(0),UCorrSrc(0x20)), UCorrErr(CpTim), FirstUCorrErr(CpTim), Hdr[0](HdrLen(1),AddrType(0),Attr(0),Tc(0),Type(10),Format(2)), Hdr[1]((0x11000004)), Hdr[2]((0x70184c)), Hdr[3]((0)); Br[8624](14,5,0): DevStatus(Corr,UnSup), CorrErr(RNRov,RpTim,AdvsNF); Dv[6432](16,0,0): Status(0xffff), DevStatus(0xffff), CorrErr(0xffffffff), UCorrErr(0xfffffffe), FirstUCorrErr(0xffffffff); Dv[6432](16,0,1): Status(0xffff), DevStatus(0xffff), CorrErr(0xffffffff), UCorrErr(0xfffffffe), FirstUCorrErr(0xffffffff).

NetApp replaced the mainboard and since then we have not seen any additional panics.

BR Marcus

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Jayanathan, David Sent: den 3 januari 2013 00:55 To: Fletcher Cocquyt; Doug Siggins Cc: netapp-users@mailman.stanford.edu; toasters@teaparty.net Lists Subject: RE: BURT 519766 panic on production 3270

Interesting that you were advised to replace all HW/cards. We've hit this three times in our environment that I know of and all times I was provided with the following information:

Bug Number / Title:

519766 / FAS32xx Uncorrectable Machine Check Error

Problem Summary:

The storage controller suffered an interruption of service due to an uncorrectable machine check error. The source of the error has not been indicated.

Recommended Solution/Workaround:

If this is the first occurrence:

- Update BIOS FAS3200: 5.1.1 or later.

- Update SP firmware to 1.2.3 or later.

- Update Data ONTAP to 8.0.2P4 or later.

- Restart the system and monitor for any repeats.

If this is the second occurrence:

- Replace the motherboard.

- Mark the faulty hardware for RCA under bug 519766.

Thanks,

David

From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Fletcher Cocquyt Sent: Wednesday, January 02, 2013 3:14 PM To: Doug Siggins Cc: netapp-users@mailman.stanford.edumailto:netapp-users@mailman.stanford.edu; toasters@teaparty.netmailto:toasters@teaparty.net Lists Subject: Re: BURT 519766 panic on production 3270

On Jan 2, 2013, at 3:02 PM, Doug Siggins <DSiggins@ma.maileig.commailto:DSiggins@ma.maileig.com> wrote:

Here are two forum posts:

https://forums.netapp.com/thread/33616 https://forums.netapp.com/thread/35456

I guess the rule of thumb is that if you see the panic more than once, you should definitely think about hardware replacements.

One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 - lump of coal ?

The good news is, when our system panic'ed and rebooted, the failover performed as expected so we had only a 2 second timeout logged on our ESXi hosts, Oracle - no downtime.

Anyone else encountered this issue? What was your action and outcome?

thanks,

Fletcher Cocquyt Stanford University School of Medicine

Steffen Knauf

12:15 p.m.

New subject: AW: BURT 519766 panic on production 3270

hi,

we're running into the same error (FAS3240):

Uncorrectable Machine Check Error at CPU3. MC5 Error: STATUS<0xb200000080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen) ); PLX PCI-E switch on Controller. Root Port(0,6,0): SecStatus(RcvMstAbt,RcvSysErr); Br[8624](9,0,0): Status(SigSysErr), DevStatus(Corr,NFatal,UnSup), CorrErr(AdvsNF), UCorrErr(UsReq), FirstUCorrErr(UsReq), Hdr[0](HdrLen(1),AddrType(0),Attr(0),Tc(0),Type(0) ,Format(2)), Hdr[1]((0x70090f)), Hdr[2]((0xdf50404c)), Hdr[3]((0x1c00)).

Problem Summary:

Device Br[8624](9,0,0) reported seeing the following error(s): "Unsupported Request (UsReq): Some aspect of a received PCI packet was unsupported".

A Netapp Engineer told us that the only working solution is to replace all FRU/Cards.

greets

Steffen

Von: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Im Auftrag von Jayanathan, David Gesendet: Donnerstag, 3. Januar 2013 00:55 An: Fletcher Cocquyt; Doug Siggins Cc: netapp-users@mailman.stanford.edu; toasters@teaparty.net Lists Betreff: RE: BURT 519766 panic on production 3270

Interesting that you were advised to replace all HW/cards. We've hit this three times in our environment that I know of and all times I was provided with the following information:

Bug Number / Title:

519766 / FAS32xx Uncorrectable Machine Check Error

Problem Summary:

The storage controller suffered an interruption of service due to an uncorrectable machine check error. The source of the error has not been indicated.

Recommended Solution/Workaround:

If this is the first occurrence:

- Update BIOS FAS3200: 5.1.1 or later.

- Update SP firmware to 1.2.3 or later.

- Update Data ONTAP to 8.0.2P4 or later.

- Restart the system and monitor for any repeats.

If this is the second occurrence:

- Replace the motherboard.

- Mark the faulty hardware for RCA under bug 519766.

Thanks,

David

thanks - yes the fact Netapp is immediately willing to replace most of our HW indicates they know its an issue with our current HW

I'd feel better recommending this plan if they could point to the specific HW issue in the current HW and demonstrate how its fixed in newer HW revs

Currently those details are not public/forthcoming...

On Jan 2, 2013, at 3:02 PM, Doug Siggins DSiggins@ma.maileig.com wrote:

Fletcher,

What was the panic string? Did you get a core to netapp? Sometimes support is a bit reluctant to investigate further unless you press for a real answer. After 2-3 core dumps with the same type panic string, I start demanding a fix whether it be hardware or software.

Here are two forum posts:

https://forums.netapp.com/thread/33616

https://forums.netapp.com/thread/35456

I guess the rule of thumb is that if you see the panic more than once, you should definitely think about hardware replacements.

_____

From: toasters-bounces@teaparty.net [toasters-bounces@teaparty.net] on behalf of Fletcher Cocquyt [fcocquyt@stanford.edu] Sent: Wednesday, January 02, 2013 5:21 PM To: toasters@teaparty.net Lists Cc: netapp-users@mailman.stanford.edu Subject: BURT 519766 panic on production 3270

Happy 2013!

One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 - lump of coal ?

The good news is, when our system panic'ed and rebooted, the failover performed as expected so we had only a 2 second timeout logged on our ESXi hosts, Oracle - no downtime.

There is scarce public info on this issue and Netapp is recommending options from "do nothing - (its rare and may never happen again)" to "replace motherboards and all cards"

Our 3270 clusters (we have 2 in Active:Standby mode) have been stable since we installed them in Feb 2011. We are on 8.1GA - Netapp support says the issue is independent of OnTAP version.

Anyone else encountered this issue?

What was your action and outcome?

thanks,

Fletcher Cocquyt

Stanford University School of Medicine

Unnikrishnan KP

2:12 p.m.

New subject: BURT 519766 panic on production 3270

Hello all, I have seen this happen at a few customer sites and for any errors the new NetApp policy seems to be raplacing the hardware. In one instance the system board and all PCI cards were replaced.

This was the panic string:

Uncorrectable Machine Check Error at CPU0. MC5 Error: STATUS<0xb200001084200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)); Root Port(0,6,0): DevStatus(Corr), CorrErr(Rcvr). in SK process idle_thread0 on release 8.1

Regards, Unnikrishnan KP

On 3 January 2013 12:15, Steffen Knauf sknauf@chipxonio.de wrote:

...

hi,****

we're running into the same error (FAS3240):****

Uncorrectable Machine Check Error at CPU3. MC5 Error: STATUS<0xb200000080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)); PLX PCI-E switch on Controller. Root Port(0,6,0): SecStatus(RcvMstAbt,RcvSysErr); Br[8624](9,0,0): Status(SigSysErr), DevStatus(Corr,NFatal,UnSup), CorrErr(AdvsNF), UCorrErr(UsReq), FirstUCorrErr(UsReq), Hdr[0](HdrLen(1),AddrType(0),Attr(0),Tc(0),Type(0) ,Format(2)), Hdr[1]((0x70090f)), Hdr[2]((0xdf50404c)), Hdr[3]((0x1c00)).

Problem Summary:

Device Br[8624](9,0,0) reported seeing the following error(s): "Unsupported Request (UsReq): Some aspect of a received PCI packet was unsupported".

A Netapp Engineer told us that the only working solution is to replace all FRU/Cards.****

greets****

Steffen****

*Von:* toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] *Im Auftrag von *Jayanathan, David *Gesendet:* Donnerstag, 3. Januar 2013 00:55 *An:* Fletcher Cocquyt; Doug Siggins *Cc:* netapp-users@mailman.stanford.edu; toasters@teaparty.net Lists *Betreff:* RE: BURT 519766 panic on production 3270****

Interesting that you were advised to replace all HW/cards. We’ve hit this three times in our environment that I know of and all times I was provided with the following information:****

Bug Number / Title:****

519766 / FAS32xx Uncorrectable Machine Check Error****

Problem Summary:****

The storage controller suffered an interruption of service due to an uncorrectable machine check error. The source of the error has not been indicated.****

Recommended Solution/Workaround:****

If this is the first occurrence:****

Update BIOS FAS3200: 5.1.1 or later.****

Update SP firmware to 1.2.3 or later.****

Update Data ONTAP to 8.0.2P4 or later.****

Restart the system and monitor for any repeats.****

If this is the second occurrence:****

Replace the motherboard.****

Mark the faulty hardware for RCA under bug 519766.****

I have yet to find any references in mail of hitting this bug on any of our systems running 8.0.2P7 or above. Two of the times we hit it were on the same system, so on the 2nd time we upgraded to 8.0.2P4 and performed a motherboard swap. The other time we purely did a failback and opted out of doing a code upgrade.****

Thanks,****

David****

*From:* toasters-bounces@teaparty.net [ mailto:toasters-bounces@teaparty.net toasters-bounces@teaparty.net] *On Behalf Of *Fletcher Cocquyt *Sent:* Wednesday, January 02, 2013 3:14 PM *To:* Doug Siggins *Cc:* netapp-users@mailman.stanford.edu; toasters@teaparty.net Lists *Subject:* Re: BURT 519766 panic on production 3270****

Dec 25 04:19:26 na03.GoCardinal.EDU Dec 25 12:20:49 [na03:mgr.stack.string:notice]: Panic string: Uncorrectable Machine Check Error at CPU1. MC5 Error: STATUS<0xb200001080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)); PLX PCI-E switch on IO Exp ****

thanks - yes the fact Netapp is immediately willing to replace most of our HW indicates they know its an issue with our current HW****

I'd feel better recommending this plan if they could point to the specific HW issue in the current HW and demonstrate how its fixed in newer HW revs*

Currently those details are not public/forthcoming...****

On Jan 2, 2013, at 3:02 PM, Doug Siggins DSiggins@ma.maileig.com wrote:*

Fletcher,****

What was the panic string? Did you get a core to netapp? Sometimes support is a bit reluctant to investigate further unless you press for a real answer. After 2-3 core dumps with the same type panic string, I start demanding a fix whether it be hardware or software.****

Here are two forum posts:****

https://forums.netapp.com/thread/33616****

https://forums.netapp.com/thread/35456****

I had a similar issue on an older filer. It panic'd 2-3 times. Luckily, over time has we prepare to retire the system the load has dropped significantly, and I haven't seen the NMI panic for 6+ months. I had suggested we replace the system immediately and migrate off.****

I guess the rule of thumb is that if you see the panic more than once, you should definitely think about hardware replacements.****

*From:* toasters-bounces@teaparty.net [toasters-bounces@teaparty.net] on behalf of Fletcher Cocquyt [fcocquyt@stanford.edu] *Sent:* Wednesday, January 02, 2013 5:21 PM *To:* toasters@teaparty.net Lists *Cc:* netapp-users@mailman.stanford.edu *Subject:* BURT 519766 panic on production 3270****

Happy 2013!****

One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 - lump of coal ?****

The good news is, when our system panic'ed and rebooted, the failover performed as expected so we had only a 2 second timeout logged on our ESXi hosts, Oracle - no downtime.****

There is scarce public info on this issue and Netapp is recommending options from "do nothing - (its rare and may never happen again)" to "replace motherboards and all cards"****

Our 3270 clusters (we have 2 in Active:Standby mode) have been stable since we installed them in Feb 2011. We are on 8.1GA - Netapp support says the issue is independent of OnTAP version.****

Anyone else encountered this issue?****

What was your action and outcome?****

thanks,****

Fletcher Cocquyt****

Stanford University School of Medicine****

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Dan Burkland

3:22 p.m.

New subject: BURT 519766 panic on production 3270

We experienced the same issue with our NetApp 6080s running 8.0.1P4 and NetApp replaced the the system boards in each.

Regards,

Dan

From: Unnikrishnan KP <krshnakp@gmail.commailto:krshnakp@gmail.com> Date: Thursday, January 3, 2013 8:12 AM To: Steffen Knauf <sknauf@chipxonio.demailto:sknauf@chipxonio.de> Cc: "netapp-users@mailman.stanford.edumailto:netapp-users@mailman.stanford.edu" <netapp-users@mailman.stanford.edumailto:netapp-users@mailman.stanford.edu>, "toasters@teaparty.netmailto:toasters@teaparty.net" <toasters@teaparty.netmailto:toasters@teaparty.net> Subject: Re: BURT 519766 panic on production 3270

This was the panic string:

Regards, Unnikrishnan KP

On 3 January 2013 12:15, Steffen Knauf <sknauf@chipxonio.demailto:sknauf@chipxonio.de> wrote: hi,

we're running into the same error (FAS3240):

Uncorrectable Machine Check Error at CPU3. MC5 Error: STATUS<0xb200000080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)); PLX PCI-E switch on Controller. Root Port(0,6,0): SecStatus(RcvMstAbt,RcvSysErr); Br[8624](9,0,0): Status(SigSysErr), DevStatus(Corr,NFatal,UnSup), CorrErr(AdvsNF), UCorrErr(UsReq), FirstUCorrErr(UsReq), Hdr[0](HdrLen(1),AddrType(0),Attr(0),Tc(0),Type(0) ,Format(2)), Hdr[1]((0x70090f)), Hdr[2]((0xdf50404c)), Hdr[3]((0x1c00)).

Problem Summary:

Device Br[8624](9,0,0) reported seeing the following error(s): "Unsupported Request (UsReq): Some aspect of a received PCI packet was unsupported".

A Netapp Engineer told us that the only working solution is to replace all FRU/Cards.

greets

Steffen

Von:toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net] Im Auftrag von Jayanathan, David Gesendet: Donnerstag, 3. Januar 2013 00:55 An: Fletcher Cocquyt; Doug Siggins Cc: netapp-users@mailman.stanford.edumailto:netapp-users@mailman.stanford.edu; toasters@teaparty.netmailto:toasters@teaparty.net Lists Betreff: RE: BURT 519766 panic on production 3270

Interesting that you were advised to replace all HW/cards. We’ve hit this three times in our environment that I know of and all times I was provided with the following information:

Bug Number / Title:

519766 / FAS32xx Uncorrectable Machine Check Error

Problem Summary:

The storage controller suffered an interruption of service due to an uncorrectable machine check error. The source of the error has not been indicated.

Recommended Solution/Workaround:

If this is the first occurrence:

- Update BIOS FAS3200: 5.1.1 or later.

- Update SP firmware to 1.2.3 or later.

- Update Data ONTAP to 8.0.2P4 or later.

- Restart the system and monitor for any repeats.

If this is the second occurrence:

- Replace the motherboard.

- Mark the faulty hardware for RCA under bug 519766.

Thanks,

David

From:toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Fletcher Cocquyt Sent: Wednesday, January 02, 2013 3:14 PM To: Doug Siggins Cc: netapp-users@mailman.stanford.edumailto:netapp-users@mailman.stanford.edu; toasters@teaparty.netmailto:toasters@teaparty.net Lists Subject: Re: BURT 519766 panic on production 3270

On Jan 2, 2013, at 3:02 PM, Doug Siggins <DSiggins@ma.maileig.commailto:DSiggins@ma.maileig.com> wrote:

Here are two forum posts:

https://forums.netapp.com/thread/33616 https://forums.netapp.com/thread/35456

I guess the rule of thumb is that if you see the panic more than once, you should definitely think about hardware replacements.

One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 - lump of coal ?

The good news is, when our system panic'ed and rebooted, the failover performed as expected so we had only a 2 second timeout logged on our ESXi hosts, Oracle - no downtime.

Anyone else encountered this issue? What was your action and outcome?

thanks,

Fletcher Cocquyt Stanford University School of Medicine

_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Fletcher Cocquyt

9 Jan 9 Jan

6:49 a.m.

New subject: BURT 519766 panic on production 3270

Yes - we are taking the stance of requesting more information

We'd be more confident recommending the drastic action of replacing all HW if Netapp could point to the specific HW issue(s) in the current HW and demonstrate how its fixed in newer HW revs Currently those details are not public/forthcoming…

Without the full picture we fear the very considerable effort of the HW replacement plan work could be wasted if the issue then re-occurred - we are asking Netapp to make the technical details available to us to raise confidence in the HW replacement plan

thanks

On Jan 2, 2013, at 3:54 PM, "Jayanathan, David" djayan@qualcomm.com wrote:

...

Interesting that you were advised to replace all HW/cards. We’ve hit this three times in our environment that I know of and all times I was provided with the following information:

Bug Number / Title: 519766 / FAS32xx Uncorrectable Machine Check Error

Problem Summary: The storage controller suffered an interruption of service due to an uncorrectable machine check error. The source of the error has not been indicated.

Recommended Solution/Workaround: If this is the first occurrence:

Update BIOS FAS3200: 5.1.1 or later.

Update SP firmware to 1.2.3 or later.

Update Data ONTAP to 8.0.2P4 or later.

Restart the system and monitor for any repeats.

If this is the second occurrence:

Replace the motherboard.

Mark the faulty hardware for RCA under bug 519766.

I have yet to find any references in mail of hitting this bug on any of our systems running 8.0.2P7 or above. Two of the times we hit it were on the same system, so on the 2ndtime we upgraded to 8.0.2P4 and performed a motherboard swap. The other time we purely did a failback and opted out of doing a code upgrade.

Thanks, David

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Fletcher Cocquyt Sent: Wednesday, January 02, 2013 3:14 PM To: Doug Siggins Cc: netapp-users@mailman.stanford.edu; toasters@teaparty.net Lists Subject: Re: BURT 519766 panic on production 3270

Dec 25 04:19:26 na03.GoCardinal.EDU Dec 25 12:20:49 [na03:mgr.stack.string:notice]: Panic string: Uncorrectable Machine Check Error at CPU1. MC5 Error: STATUS<0xb200001080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)); PLX PCI-E switch on IO Exp

thanks - yes the fact Netapp is immediately willing to replace most of our HW indicates they know its an issue with our current HW I'd feel better recommending this plan if they could point to the specific HW issue in the current HW and demonstrate how its fixed in newer HW revs Currently those details are not public/forthcoming...

On Jan 2, 2013, at 3:02 PM, Doug Siggins DSiggins@ma.maileig.com wrote:

Fletcher, What was the panic string? Did you get a core to netapp? Sometimes support is a bit reluctant to investigate further unless you press for a real answer. After 2-3 core dumps with the same type panic string, I start demanding a fix whether it be hardware or software.

Here are two forum posts:

https://forums.netapp.com/thread/33616 https://forums.netapp.com/thread/35456

I had a similar issue on an older filer. It panic'd 2-3 times. Luckily, over time has we prepare to retire the system the load has dropped significantly, and I haven't seen the NMI panic for 6+ months. I had suggested we replace the system immediately and migrate off.

I guess the rule of thumb is that if you see the panic more than once, you should definitely think about hardware replacements.

From: toasters-bounces@teaparty.net [toasters-bounces@teaparty.net] on behalf of Fletcher Cocquyt [fcocquyt@stanford.edu] Sent: Wednesday, January 02, 2013 5:21 PM To: toasters@teaparty.net Lists Cc: netapp-users@mailman.stanford.edu Subject: BURT 519766 panic on production 3270

Happy 2013!

One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 - lump of coal ?

The good news is, when our system panic'ed and rebooted, the failover performed as expected so we had only a 2 second timeout logged on our ESXi hosts, Oracle - no downtime.

There is scarce public info on this issue and Netapp is recommending options from "do nothing - (its rare and may never happen again)" to "replace motherboards and all cards" Our 3270 clusters (we have 2 in Active:Standby mode) have been stable since we installed them in Feb 2011. We are on 8.1GA - Netapp support says the issue is independent of OnTAP version.

Anyone else encountered this issue? What was your action and outcome?

thanks,

Fletcher Cocquyt Stanford University School of Medicine

Unnikrishnan KP

10 Jan 10 Jan

12:06 p.m.

New subject: BURT 519766 panic on production 3270

Hello all, The two BUG ID's and their information from NetApp is not useful at all to say the least:

https://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=504167

https://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=519766

Regards, Unnikrishnan KP

On 9 January 2013 06:49, Fletcher Cocquyt fcocquyt@stanford.edu wrote:

...

Yes - we are taking the stance of requesting more information

We'd be more confident recommending the drastic action of replacing all HW if Netapp could point to the specific HW issue(s) in the current HW and demonstrate how its fixed in newer HW revs Currently those details are not public/forthcoming…

Without the full picture we fear the very considerable effort of the HW replacement plan work could be wasted if the issue then re-occurred - we are asking Netapp to make the technical details available to us to raise confidence in the HW replacement plan

thanks

On Jan 2, 2013, at 3:54 PM, "Jayanathan, David" djayan@qualcomm.com wrote:

Interesting that you were advised to replace all HW/cards. We’ve hit this three times in our environment that I know of and all times I was provided with the following information:****

Bug Number / Title:**** 519766 / FAS32xx Uncorrectable Machine Check Error****

Problem Summary:**** The storage controller suffered an interruption of service due to an uncorrectable machine check error. The source of the error has not been indicated.****

Recommended Solution/Workaround:**** If this is the first occurrence:****

Update BIOS FAS3200: 5.1.1 or later.****

Update SP firmware to 1.2.3 or later.****

Update Data ONTAP to 8.0.2P4 or later.****

Restart the system and monitor for any repeats.****

If this is the second occurrence:****

Replace the motherboard.****

Mark the faulty hardware for RCA under bug 519766.****

I have yet to find any references in mail of hitting this bug on any of our systems running 8.0.2P7 or above. Two of the times we hit it were on the same system, so on the 2ndtime we upgraded to 8.0.2P4 and performed a motherboard swap. The other time we purely did a failback and opted out of doing a code upgrade.****

Thanks,**** David****

*From:* toasters-bounces@teaparty.net [mailto:toasters- bounces@teaparty.net] *On Behalf Of *Fletcher Cocquyt *Sent:* Wednesday, January 02, 2013 3:14 PM *To:* Doug Siggins *Cc:* netapp-users@mailman.stanford.edu; toasters@teaparty.net Lists *Subject:* Re: BURT 519766 panic on production 3270****

Dec 25 04:19:26 na03.GoCardinal.EDU Dec 25 12:20:49 [na03:mgr.stack.string:notice]: Panic string: Uncorrectable Machine Check Error at CPU1. MC5 Error: STATUS<0xb200001080200e0f>(Val,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)); PLX PCI-E switch on IO Exp ****

thanks - yes the fact Netapp is immediately willing to replace most of our HW indicates they know its an issue with our current HW**** I'd feel better recommending this plan if they could point to the specific HW issue in the current HW and demonstrate how its fixed in newer HW revs*

Currently those details are not public/forthcoming...****

On Jan 2, 2013, at 3:02 PM, Doug Siggins DSiggins@ma.maileig.com wrote:*

Fletcher,**** What was the panic string? Did you get a core to netapp? Sometimes support is a bit reluctant to investigate further unless you press for a real answer. After 2-3 core dumps with the same type panic string, I start demanding a fix whether it be hardware or software.****

Here are two forum posts:****

https://forums.netapp.com/thread/33616**** https://forums.netapp.com/thread/35456****

I had a similar issue on an older filer. It panic'd 2-3 times. Luckily, over time has we prepare to retire the system the load has dropped significantly, and I haven't seen the NMI panic for 6+ months. I had suggested we replace the system immediately and migrate off.****

I guess the rule of thumb is that if you see the panic more than once, you should definitely think about hardware replacements.****

*From:* toasters-bounces@teaparty.net [toasters-bounces@teaparty.net] on behalf of Fletcher Cocquyt [fcocquyt@stanford.edu] *Sent:* Wednesday, January 02, 2013 5:21 PM *To:* toasters@teaparty.net Lists *Cc:* netapp-users@mailman.stanford.edu *Subject:* BURT 519766 panic on production 3270**** Happy 2013!****

One of our production 3270 heads panic'ed and rebooted 3:30 am Dec 25 - lump of coal ?****

The good news is, when our system panic'ed and rebooted, the failover performed as expected so we had only a 2 second timeout logged on our ESXi hosts, Oracle - no downtime.****

There is scarce public info on this issue and Netapp is recommending options from "do nothing - (its rare and may never happen again)" to "replace motherboards and all cards"**** Our 3270 clusters (we have 2 in Active:Standby mode) have been stable since we installed them in Feb 2011. We are on 8.1GA - Netapp support says the issue is independent of OnTAP version.****

Anyone else encountered this issue?**** What was your action and outcome?****

thanks,****

Fletcher Cocquyt**** Stanford University School of Medicine****

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Antonio Varni

8:04 p.m.

New subject: BURT 519766 panic on production 3270

It's not a lot more but there are little clues you can coax out of support.netapp.com

Maybe this is already known... you can at least get the burt title, etc. This can give you additional things to search for and try to piece together as much as info as possible without having internal netapp support access Maybe you can find this elsewhere so sorry if this adds little to the conversation:

https://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=504167 additional info:

504167: FAS32xx Uncorrectable Machine Checks Content Type: Troubleshooting and Support Content Sub-Type: Bug Short Title: 504167: FAS32xx Uncorrectable Machine Checks Last Updated Date: Fri, 16 Dec 2011 02:29:19 PST File Type: htm Description: Uncorrectable Machine Checks may occur on the FAS32xx platform. These require diagnosis and remediation by Support Engineers Bug Id: 504167 Date Created: Mon, 09 May 2011 07:24:02 PDT Keywords: FAS32xx Uncorrectable Machine Checks Burt Title: Carnegie: PMC-Sierra SAS HBA NMI PCIe panic caused by MfTLB (Malformed TLP) Burt Link: http://burtweb-prd.eng.netapp.com/burt/burt-bin/start?burt-id=504167 Duplicate Of: 519766.0 Burt Patch Release: - Fixed-In Version: -

_ _ antonio varni [technology]

Estalea, L.P. 10 E. Figueroa St,.2nd Floor Santa Barbara, CA 93101 v 805.252.0115 f 805.899.2697 e avarni@estalea.com w www.estalea.com

On Thu, Jan 10, 2013 at 4:06 AM, Unnikrishnan KP krshnakp@gmail.com wrote:

...

Hello all, The two BUG ID's and their information from NetApp is not useful at all to say the least:

https://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=504167

https://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=519766

Regards, Unnikrishnan KP

Patrick Giagnocavo

8:29 p.m.

New subject: BURT 519766 panic on production 3270

I am only a newbie with NetApps, however have some experience with rackmount servers as I have 2 racks' worth of them :)

A machine check exception is generated by the CPU, usually.

This Wikipedia page tells you in general what is going on: http://en.wikipedia.org/wiki/Machine_Check_Exception

so the 2nd core (CPU1, not CPU0) had a problem (in the original post on this thread).

The problem was not correctable and seems to have been on the PCI Express bus (either on the bridge chip itself, or a device connected to it).

You are not the only person to experience this (found via google): https://twitter.com/nerdicwalker/status/110360608121167873

They require diagnosis because the error message is not specific enough to figure out what is going on.

The only times I have seen this in my systems (non-NA) were 1) bad or slightly incompatible RAM, easily fixed 2) motherboard was bad and I stopped using it. So, there is a quite a range as to what can be going on.

Hope this helps,

Patrick

PS am looking for FAS250 or so on the cheap for testing / dev work if anyone has one.

Fletcher Cocquyt

15 Jan 15 Jan

4:42 a.m.

New subject: BURT 519766 panic on production 3270

We met with our Netapp team today and received the technical explanation we needed to move forward with the hardware replacement option.

As one reply already mentioned, there is a real hardware issue identified with the 32xx/62xx series and Netapp is now working to proactively replace the parts with suspect PCM (DRAM), SAS, IOxM chips

our clusters operate in active:standby mode so we won't need downtime or risk of production failover for this fix.

thanks

On Jan 10, 2013, at 12:29 PM, Patrick Giagnocavo xemacs5@gmail.com wrote:

...

I am only a newbie with NetApps, however have some experience with rackmount servers as I have 2 racks' worth of them :)

A machine check exception is generated by the CPU, usually.

This Wikipedia page tells you in general what is going on: http://en.wikipedia.org/wiki/Machine_Check_Exception

so the 2nd core (CPU1, not CPU0) had a problem (in the original post on this thread).

The problem was not correctable and seems to have been on the PCI Express bus (either on the bridge chip itself, or a device connected to it).

You are not the only person to experience this (found via google): https://twitter.com/nerdicwalker/status/110360608121167873

They require diagnosis because the error message is not specific enough to figure out what is going on.

The only times I have seen this in my systems (non-NA) were 1) bad or slightly incompatible RAM, easily fixed 2) motherboard was bad and I stopped using it. So, there is a quite a range as to what can be going on.

Hope this helps,

Patrick

PS am looking for FAS250 or so on the cheap for testing / dev work if anyone has one. _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Unnikrishnan KP

16 Jan 16 Jan

1:22 p.m.

New subject: BURT 519766 panic on production 3270

Hello all, I have brought this up as a NetApp community string: https://communities.netapp.com/thread/25901

I hope we can get more information from other users too.

Regards, Unnikrishnan KP

On 15 January 2013 04:42, Fletcher Cocquyt fcocquyt@stanford.edu wrote:

...

We met with our Netapp team today and received the technical explanation we needed to move forward with the hardware replacement option.

As one reply already mentioned, there is a real hardware issue identified with the 32xx/62xx series and Netapp is now working to proactively replace the parts with suspect PCM (DRAM), SAS, IOxM chips

our clusters operate in active:standby mode so we won't need downtime or risk of production failover for this fix.

thanks

On Jan 10, 2013, at 12:29 PM, Patrick Giagnocavo xemacs5@gmail.com wrote:

I am only a newbie with NetApps, however have some experience with rackmount servers as I have 2 racks' worth of them :)

A machine check exception is generated by the CPU, usually.

This Wikipedia page tells you in general what is going on: http://en.wikipedia.org/wiki/Machine_Check_Exception

so the 2nd core (CPU1, not CPU0) had a problem (in the original post on this thread).

The problem was not correctable and seems to have been on the PCI Express bus (either on the bridge chip itself, or a device connected to it).

You are not the only person to experience this (found via google): https://twitter.com/nerdicwalker/status/110360608121167873

They require diagnosis because the error message is not specific enough to figure out what is going on.

The only times I have seen this in my systems (non-NA) were 1) bad or slightly incompatible RAM, easily fixed 2) motherboard was bad and I stopped using it. So, there is a quite a range as to what can be going on.

Hope this helps,

Patrick

PS am looking for FAS250 or so on the cheap for testing / dev work if anyone has one. _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

4816

Age (days ago)

4830

Last active (days ago)

toasters@lists.teaparty.net

11 comments

8 participants

tags (0)

participants (8)

Antonio Varni
Dan Burkland
Fletcher Cocquyt
Jayanathan, David
Nilsson Marcus
Patrick Giagnocavo
Steffen Knauf
Unnikrishnan KP