Motherboard replacement on FAS3270 caused fabric wide issue

List overview All Threads
Download

newer

older

CIFS NOT WORKING

Check config

Momonth

2 Feb 2015 2 Feb '15

10:25 a.m.

Hi All,

I hit the following bug on one of the filer (FAS3260) I manage:

http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=659544

This filer (filer-prod-204) works in HA mode with filer-prod-203. They are connected to two redundant FC SAN fabrics (one connection from each filer per fabric). There are more HA pairs connected to the same fabrics, eg filer-prod-201 / filer-prod-202. All of the filers we have are running in 'single-image' mode. We run FC SAN fabrics in "hard zoning mode".

NetApp support conclusion was to replace motherboard on the filer and we proceeded with that.

Here is an issue we had and I have no explanation to that, I hope you guys can help me with that:

Once the filer-prod-204 got the motherboard replaced, powered on and entered HW diagnostics mode I've seen the messages as below *on every other filer* (eg. filer-prod-201), connected to the same fabric, causing issues on hosts (CentOS 6.4 mainly) attached to them:

Fri Jan 30 20:07:45 CET [filer-prod-201: scsitarget.ispfct.targetReset:notice]: FCP Target 0c: Target was Reset by the Initiator at Port Id: 0x11000 (WWPN 5001438021e071ec) Fri Jan 30 20:07:46 CET [filer-prod-201: scsitarget.ispfct.targetReset:notice]: FCP Target 0c: Target was Reset by the Initiator at Port Id: 0x10200 (WWPN 50014380186abac4) ...

Fri Jan 30 20:08:14 CET [filer-prod-201: scsitarget.ispfct.portLogin:notice]: FCP login on Fibre Channel adapter '0c' from '50:01:43:80:21:e0:71:ec', address 0x11000. Fri Jan 30 20:08:14 CET [filer-prod-201: scsitarget.ispfct.portLogin:notice]: FCP login on Fibre Channel adapter '0c' from '50:01:43:80:18:6a:ba:c4', address 0x10200.

So every single initiator on the filer *not involved* in the maintenance were reset, then tried to login back, reset again and it looped like that until I disabled filer-prod-204's target ports on the FC switches. Once the filer-prod-204 booted up with OnTAP, the issue was gone. I know it because when I tried to re-enabled the filer-prod-204's target ports, I didn't see any message like above and everything is running fine since then.

Does anyone have an idea what was happing here and why?

Cheers, Vladimir

Show replies by date

Borzenkov, Andrei

2 Feb 2 Feb

12:03 p.m.

My best guess is that filer ports were configured as initiator by default and somehow conflicted with host HBAs (filer will try to use LUNs is found as disks). Do you use two port zones on fan-out (single initiator - multiple targets)? Note that motherboard replacement procedure recommends unconnecting ports until they are properly configured.

-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Momonth Sent: Monday, February 02, 2015 1:26 PM To: toasters@teaparty.net Subject: Motherboard replacement on FAS3270 caused fabric wide issue

Hi All,

I hit the following bug on one of the filer (FAS3260) I manage:

http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=659544

NetApp support conclusion was to replace motherboard on the filer and we proceeded with that.

Here is an issue we had and I have no explanation to that, I hope you guys can help me with that:

Does anyone have an idea what was happing here and why?

Cheers, Vladimir _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Momonth

2:10 p.m.

On Mon, Feb 2, 2015 at 1:03 PM, Borzenkov, Andrei andrei.borzenkov@ts.fujitsu.com wrote:

...

My best guess is that filer ports were configured as initiator by default and somehow conflicted with host HBAs (filer will try to use LUNs is found as disks). Do you use two port zones on fan-out (single initiator - multiple targets)? Note that motherboard replacement procedure recommends unconnecting ports until they are properly configured.

Due to "historical reasons" our zones are "two initiators, multiple targets", i know it's sub-optimal, but that's the way it is. Such zones always worked with controlled failovers, OnTAP upgrades etc.

When the NetApp technician arrived, I specifically asked him if it would be the best to disable respective ports on the fabrics for the filer in question (as I bet I saw this behaviour already once), but the answer was "no, it sould not affect anything".

Basil

2:49 p.m.

The root problem here is nodes not part of the cluster getting resets from linux hosts. I'm not a low level scsi expert, but once we had a problem that resulted in resets being sent and causing issues, and I think I remember hearing that they affect the entire zone. Meaning everything that can "see" the initiator will be told to reset.

It's nondisruptive to change from your zoning setup to a more optimal one where each zone contains a single initiator and a single target. Also, you mentioned "hard" zoning- did you mean that literally, like your zones have physical port locations in them?

On Mon, Feb 2, 2015 at 9:10 AM, Momonth momonth@gmail.com wrote:

...

On Mon, Feb 2, 2015 at 1:03 PM, Borzenkov, Andrei andrei.borzenkov@ts.fujitsu.com wrote:

...
My best guess is that filer ports were configured as initiator by

default and somehow conflicted with host HBAs (filer will try to use LUNs is found as disks). Do you use two port zones on fan-out (single initiator

multiple targets)? Note that motherboard replacement procedure recommends

unconnecting ports until they are properly configured.

...
Due to "historical reasons" our zones are "two initiators, multiple targets", i know it's sub-optimal, but that's the way it is. Such zones always worked with controlled failovers, OnTAP upgrades etc.

When the NetApp technician arrived, I specifically asked him if it would be the best to disable respective ports on the fabrics for the filer in question (as I bet I saw this behaviour already once), but the answer was "no, it sould not affect anything".

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Momonth

3:12 p.m.

Also, it's wort mentioning, all the filers involved are 7M filer, but different releases.

On Mon, Feb 2, 2015 at 3:49 PM, Basil basilberntsen@gmail.com wrote:

...

Also, you mentioned "hard" zoning- did you mean that literally, like your zones have physical port locations in them?

Yes, something like that:

zone: zone_test 2,2 2,3 2,14 2,15

where "2," - switch ID, "2,3,14,15" - port ID

Borzenkov, Andrei

3:15 p.m.

Actually today on Brocade both pure port and pure WWN zoning are hard zoning. Only mixed mode (some port and some WWN) are soft.

Anyway I still believe the problem is due to initiator ports seeing other targets and trying to take ownership of them which - in case of NetApp - means setting SCSI reservation on them.

-----Original Message----- From: vladimir.zhigulin@gmail.com [mailto:vladimir.zhigulin@gmail.com] On Behalf Of Momonth Sent: Monday, February 02, 2015 6:12 PM To: Basil Cc: Borzenkov, Andrei; toasters@teaparty.net Subject: Re: Motherboard replacement on FAS3270 caused fabric wide issue

Also, it's wort mentioning, all the filers involved are 7M filer, but different releases.

On Mon, Feb 2, 2015 at 3:49 PM, Basil basilberntsen@gmail.com wrote:

...

Also, you mentioned "hard" zoning- did you mean that literally, like your zones have physical port locations in them?

Yes, something like that:

zone: zone_test 2,2 2,3 2,14 2,15

where "2," - switch ID, "2,3,14,15" - port ID

Momonth

3:23 p.m.

I have a ticket opened with NetApp support, will update the thread once there are any results.

Also, came across this KB - https://kb.netapp.com/support/index?page=content&id=3012956&pmv=prin... I'm now reading and trying to understand how releveant it is in my case.

On Mon, Feb 2, 2015 at 4:15 PM, Borzenkov, Andrei andrei.borzenkov@ts.fujitsu.com wrote:

...

Actually today on Brocade both pure port and pure WWN zoning are hard zoning. Only mixed mode (some port and some WWN) are soft.

Anyway I still believe the problem is due to initiator ports seeing other targets and trying to take ownership of them which - in case of NetApp - means setting SCSI reservation on them.

-----Original Message----- From: vladimir.zhigulin@gmail.com [mailto:vladimir.zhigulin@gmail.com] On Behalf Of Momonth Sent: Monday, February 02, 2015 6:12 PM To: Basil Cc: Borzenkov, Andrei; toasters@teaparty.net Subject: Re: Motherboard replacement on FAS3270 caused fabric wide issue

Also, it's wort mentioning, all the filers involved are 7M filer, but different releases.

On Mon, Feb 2, 2015 at 3:49 PM, Basil basilberntsen@gmail.com wrote:

...
Also, you mentioned "hard" zoning- did you mean that literally, like your zones have physical port locations in them?

Yes, something like that:

zone: zone_test 2,2 2,3 2,14 2,15

where "2," - switch ID, "2,3,14,15" - port ID

Momonth

4:12 p.m.

I think I localized the problem only to "NetApp Release 8.0.2P4 7-Mode", ie initiator attached to the filers running this version of OnTAP were affected.

On Mon, Feb 2, 2015 at 4:23 PM, Momonth momonth@gmail.com wrote:

...

I have a ticket opened with NetApp support, will update the thread once there are any results.

Also, came across this KB - https://kb.netapp.com/support/index?page=content&id=3012956&pmv=prin... I'm now reading and trying to understand how releveant it is in my case.

On Mon, Feb 2, 2015 at 4:15 PM, Borzenkov, Andrei andrei.borzenkov@ts.fujitsu.com wrote:

...
Actually today on Brocade both pure port and pure WWN zoning are hard zoning. Only mixed mode (some port and some WWN) are soft.

Anyway I still believe the problem is due to initiator ports seeing other targets and trying to take ownership of them which - in case of NetApp - means setting SCSI reservation on them.

-----Original Message----- From: vladimir.zhigulin@gmail.com [mailto:vladimir.zhigulin@gmail.com] On Behalf Of Momonth Sent: Monday, February 02, 2015 6:12 PM To: Basil Cc: Borzenkov, Andrei; toasters@teaparty.net Subject: Re: Motherboard replacement on FAS3270 caused fabric wide issue

Also, it's wort mentioning, all the filers involved are 7M filer, but different releases.

On Mon, Feb 2, 2015 at 3:49 PM, Basil basilberntsen@gmail.com wrote:

...
Also, you mentioned "hard" zoning- did you mean that literally, like your zones have physical port locations in them?

Yes, something like that:

zone: zone_test 2,2 2,3 2,14 2,15

where "2," - switch ID, "2,3,14,15" - port ID

Momonth

11 Feb 11 Feb

1:03 p.m.

https://library.netapp.com/ecm/ecm_download_file/ECMM1280368

See "Running diagnostics tests (controller replacement)", it states "loopback plugs" should be used for testing. It means there should be no "production" cables connected to the replaced MoBo while running the diagnostics tests.

I suspect a field engineer that worked on it plugged the cables back in once the the motherboard has been replaced and started the diagnostics, it caused issues in turn.

I'll be looking at repeating it somewhere in a test environment when possible.

Vladimir

On Mon, Feb 2, 2015 at 5:12 PM, Momonth momonth@gmail.com wrote:

...

I think I localized the problem only to "NetApp Release 8.0.2P4 7-Mode", ie initiator attached to the filers running this version of OnTAP were affected.

On Mon, Feb 2, 2015 at 4:23 PM, Momonth momonth@gmail.com wrote:

...
I have a ticket opened with NetApp support, will update the thread once there are any results.

Also, came across this KB - https://kb.netapp.com/support/index?page=content&id=3012956&pmv=prin... I'm now reading and trying to understand how releveant it is in my case.

On Mon, Feb 2, 2015 at 4:15 PM, Borzenkov, Andrei andrei.borzenkov@ts.fujitsu.com wrote:

...
Actually today on Brocade both pure port and pure WWN zoning are hard zoning. Only mixed mode (some port and some WWN) are soft.

Anyway I still believe the problem is due to initiator ports seeing other targets and trying to take ownership of them which - in case of NetApp - means setting SCSI reservation on them.

-----Original Message----- From: vladimir.zhigulin@gmail.com [mailto:vladimir.zhigulin@gmail.com] On Behalf Of Momonth Sent: Monday, February 02, 2015 6:12 PM To: Basil Cc: Borzenkov, Andrei; toasters@teaparty.net Subject: Re: Motherboard replacement on FAS3270 caused fabric wide issue

Also, it's wort mentioning, all the filers involved are 7M filer, but different releases.

On Mon, Feb 2, 2015 at 3:49 PM, Basil basilberntsen@gmail.com wrote:

...
Also, you mentioned "hard" zoning- did you mean that literally, like your zones have physical port locations in them?

Yes, something like that:

zone: zone_test 2,2 2,3 2,14 2,15

where "2," - switch ID, "2,3,14,15" - port ID

Basil

2 Feb 2 Feb

12:09 p.m.

I'd like to see what your nodes think the WWNs of the servers are. If you do an igroup show -v on each ones, do you see the same WWNs for each host?

On Mon, Feb 2, 2015 at 5:25 AM, Momonth momonth@gmail.com wrote:

...

Hi All,

I hit the following bug on one of the filer (FAS3260) I manage:

http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=659544

This filer (filer-prod-204) works in HA mode with filer-prod-203. They are connected to two redundant FC SAN fabrics (one connection from each filer per fabric). There are more HA pairs connected to the same fabrics, eg filer-prod-201 / filer-prod-202. All of the filers we have are running in 'single-image' mode. We run FC SAN fabrics in "hard zoning mode".

NetApp support conclusion was to replace motherboard on the filer and we proceeded with that.

Here is an issue we had and I have no explanation to that, I hope you guys can help me with that:

Once the filer-prod-204 got the motherboard replaced, powered on and entered HW diagnostics mode I've seen the messages as below *on every other filer* (eg. filer-prod-201), connected to the same fabric, causing issues on hosts (CentOS 6.4 mainly) attached to them:

Fri Jan 30 20:07:45 CET [filer-prod-201: scsitarget.ispfct.targetReset:notice]: FCP Target 0c: Target was Reset by the Initiator at Port Id: 0x11000 (WWPN 5001438021e071ec) Fri Jan 30 20:07:46 CET [filer-prod-201: scsitarget.ispfct.targetReset:notice]: FCP Target 0c: Target was Reset by the Initiator at Port Id: 0x10200 (WWPN 50014380186abac4) ...

Fri Jan 30 20:08:14 CET [filer-prod-201: scsitarget.ispfct.portLogin:notice]: FCP login on Fibre Channel adapter '0c' from '50:01:43:80:21:e0:71:ec', address 0x11000. Fri Jan 30 20:08:14 CET [filer-prod-201: scsitarget.ispfct.portLogin:notice]: FCP login on Fibre Channel adapter '0c' from '50:01:43:80:18:6a:ba:c4', address 0x10200.

So every single initiator on the filer *not involved* in the maintenance were reset, then tried to login back, reset again and it looped like that until I disabled filer-prod-204's target ports on the FC switches. Once the filer-prod-204 booted up with OnTAP, the issue was gone. I know it because when I tried to re-enabled the filer-prod-204's target ports, I didn't see any message like above and everything is running fine since then.

Does anyone have an idea what was happing here and why?

Cheers, Vladimir _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Momonth

2:25 p.m.

An example below is a servrer that had issues.

Note that this server / initiator is only configured on filer-prod-201 and not on the filers that had maintenecane going on:

$ ssh root@filer-prod-201 igroup show -v mc202bpmdb-01-ports mc202bpmdb-01-ports (FCP): OS Type: linux Member: 50:01:43:80:18:6a:f7:6c (logged in on: vtic, 0c) Member: 50:01:43:80:18:6a:f7:6e (not logged in) Member: 50:01:43:80:18:6b:00:2c (logged in on: vtic, 0d) Member: 50:01:43:80:18:6b:00:2e (not logged in) UUID: 437bce6a-f8c7-11e1-8651-00a0981ad474 ALUA: Yes

$ ssh root@filer-prod-204 igroup show -v | egrep "50:01:43:80:18:6a:f7:6c|50:01:43:80:18:6b:00:2c"

$ ssh root@filer-prod-203 igroup show -v | egrep "50:01:43:80:18:6a:f7:6c|50:01:43:80:18:6b:00:2c"

On Mon, Feb 2, 2015 at 1:09 PM, Basil basilberntsen@gmail.com wrote:

...

I'd like to see what your nodes think the WWNs of the servers are. If you do an igroup show -v on each ones, do you see the same WWNs for each host?

On Mon, Feb 2, 2015 at 5:25 AM, Momonth momonth@gmail.com wrote:

...
Hi All,

I hit the following bug on one of the filer (FAS3260) I manage:

http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=659544

This filer (filer-prod-204) works in HA mode with filer-prod-203. They are connected to two redundant FC SAN fabrics (one connection from each filer per fabric). There are more HA pairs connected to the same fabrics, eg filer-prod-201 / filer-prod-202. All of the filers we have are running in 'single-image' mode. We run FC SAN fabrics in "hard zoning mode".

NetApp support conclusion was to replace motherboard on the filer and we proceeded with that.

Here is an issue we had and I have no explanation to that, I hope you guys can help me with that:

Once the filer-prod-204 got the motherboard replaced, powered on and entered HW diagnostics mode I've seen the messages as below *on every other filer* (eg. filer-prod-201), connected to the same fabric, causing issues on hosts (CentOS 6.4 mainly) attached to them:

Fri Jan 30 20:07:45 CET [filer-prod-201: scsitarget.ispfct.targetReset:notice]: FCP Target 0c: Target was Reset by the Initiator at Port Id: 0x11000 (WWPN 5001438021e071ec) Fri Jan 30 20:07:46 CET [filer-prod-201: scsitarget.ispfct.targetReset:notice]: FCP Target 0c: Target was Reset by the Initiator at Port Id: 0x10200 (WWPN 50014380186abac4) ...

Fri Jan 30 20:08:14 CET [filer-prod-201: scsitarget.ispfct.portLogin:notice]: FCP login on Fibre Channel adapter '0c' from '50:01:43:80:21:e0:71:ec', address 0x11000. Fri Jan 30 20:08:14 CET [filer-prod-201: scsitarget.ispfct.portLogin:notice]: FCP login on Fibre Channel adapter '0c' from '50:01:43:80:18:6a:ba:c4', address 0x10200.

So every single initiator on the filer *not involved* in the maintenance were reset, then tried to login back, reset again and it looped like that until I disabled filer-prod-204's target ports on the FC switches. Once the filer-prod-204 booted up with OnTAP, the issue was gone. I know it because when I tried to re-enabled the filer-prod-204's target ports, I didn't see any message like above and everything is running fine since then.

Does anyone have an idea what was happing here and why?

Cheers, Vladimir _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

4071

Age (days ago)

4080

Last active (days ago)

toasters@lists.teaparty.net

10 comments

3 participants

tags (0)

participants (3)

Basil
Borzenkov, Andrei
Momonth