Hi all,
We got some major issue's with our interconnects (or maybe ISL's at this moment I have no idea), some background first:
We have a 3160 MC running Ontap 8.0.2 running stable for the last 3 years, however we are in the middle of upgrading to ontap 8.2 and for that we need to upgrade the FOS's to a higher version (we came from 6.3 and we want/need to go to 7). After we upgraded the first fabric to 7.0.0b everything seemed to work fine, however later on the day we saw a lot of errors on the netapps and on the switches, by then we changed some settings that helped us in the past for these errors (port based routing instead of exchange based which would actually be the best practice following the guides) and we started upgrading the second fabric to the intermediate FOS (6.4.2) and then immediately saw the errors coming up again at which time we stopped upgrading to 7.0.0.
So long story short, we now have 2 fabrics, 1 running at 7.0.0b and one at 6.4.2b, which obviously is not recommended but we can't do much at the moment since things are extremely unstable connection wise.
The switches themselves are set up by the best practice PDF that was available at that time, but even with the settings that changed in the meantime (basically only the portcfgfillword and I think the IOD and DLS options) we don't see any improvements, below some snippets of the log files, the data in the snippets are continuously popping up in our logfiles.
The fabrics have a single ISL (so yeah data and CI data goes through the same ISL) and we have the old DS14 ESH4 shelfs.
The weird thing is this really only happened after the FOS upgrade, we monitor all our devices quiet heavily and we haven't seen these kind of errors in the past, both the netapp errors as well as the switch errors(brocade by the way, well yeah FOS :)). At this time we're kind of clueless, Netapp support doesn't tell us much either we didn't get much further than asking for lots of log files and reseating SFP's which don't seem to work and we are still waiting (and calling) for the log analysis's.
Sorry in advance for the weird buildup of this mail but we're making some long days at the moment ;-)
If any more info might be helpfull I can send it ;-) (but don't want to flood the list right now).
Thanks!
Karsten
Porterrshow on one of the switches (all switches give the same type of results, port 0 = FCVI, port 4 = ISL:
nodes01witch10:root> porterrshow
frames enc crc crc too too bad enc disc link loss loss frjt fbsy c3timeout
tx rx in err g_eof shrt long eof out c3 fail sync sig tx rx
=======================================================================================================================
0: 16.5m 29.8m 0 0 0 0 0 0 90 0 4 4 4 0 0 0 0
1: 44.9m 22.9m 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2: 47.9m 22.3m 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4: 50.8m 59.2m 26 24 22 0 0 2 26 0 0 0 0 0 0 0 0
5: 14.2m 25.2m 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6: 14.7m 30.3m 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9: 4.5m 3.4m 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10: 861.6k 1.1m 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
14: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
15: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
18: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
19: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
20: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
21: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
22: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
23: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
nodes01witch10:root>
node01*> Fri Feb 13 01:37:28 CET [node01: cf.fsm.takeoverByPartnerDisabled:notice]: Failover monitor: takeover of node01 by node02 disabled (unsynchronized log).
Fri Feb 13 01:38:53 CET [node01: scsi.cmd.transportErrorEMSOnly:debug]: Disk device eetrsansw10:10.28: Transport error during execution of command: HA status 0x9: cdb 0x28:12a6c7b0:0060.
Fri Feb 13 01:39:33 CET [node01: cf.nm.nicTransitionDown:warning]: Cluster Interconnect link 0 is DOWN
Fri Feb 13 01:39:53 CET [node01: cf.rv.notConnected:error]: Connection for 'cfo_rv' failed.
Fri Feb 13 01:39:53 CET [node01: cf.rv.notConnected:error]: Connection for 'cfo_rv2' failed.
node01*> cf status
node02 is up, takeover disabled because of reason (interconnect error)
node01 has disabled takeover by node02 (interconnect error)
VIA Interconnect is down (link 0 up, link 1 up).
node01*> Fri Feb 13 01:40:03 CET [node01: cf.nm.nicTransitionDown:warning]: Cluster Interconnect link 1 is DOWN
Thu Feb 12 19:00:14 CET [node01: scsi.cmd.notReadyCondition:notice]: Disk device eetrsansw10:9.32: Device returns not yet ready: CDB 0x2a:325a93d8:0200: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x2)(1705).
Thu Feb 12 19:00:14 CET [node01: scsi.cmd.notReadyCondition:notice]: Disk device eetrsansw10:9.32: Device returns not yet ready: CDB 0x2a:325a95d8:0200: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x2)(1689).
Thu Feb 12 19:00:14 CET [node01: scsi.cmd.notReadyCondition:notice]: Disk device eetrsansw10:9.32: Device returns not yet ready: CDB 0x28:325a9c90:0100: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x2)(1698).
Thu Feb 12 19:00:14 CET [node01: scsi.cmd.notReadyCondition:notice]: Disk device eetrsansw10:9.32: Device returns not yet ready: CDB 0x28:325a9c58:0008: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x2)(1699).
Thu Feb 12 19:00:14 CET [node01: scsi.cmd.notReadyCondition:notice]: Disk device eetrsansw10:9.32: Device returns not yet ready: CDB 0x2a:325a97d8:0200: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x2)(1686).
Thu Feb 12 19:00:14 CET [node01: scsi.cmd.notReadyCondition:notice]: Disk device eetrsansw10:9.32: Device returns not yet ready: CDB 0x2a:325a99d8:0200: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x2)(1690).
Thu Feb 12 19:00:14 CET [node01: scsi.cmd.notReadyCondition:notice]: Disk device eetrsansw10:9.32: Device returns not yet ready: CDB 0x2f:2f27b400:0400: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x2)(474).
Thu Feb 12 19:00:14 CET [node01: scsi.cmd.notReadyCondition:notice]: Disk device eetrsansw10:9.32: Device returns not yet ready: CDB 0x2a:325a91d8:0200: Sense Data SCSI:not ready - Drive spinning up (0x2 - 0x4 0x1 0x2)(1768).
Thu Feb 12 19:02:59 CET [node01: cf.nm.nicTransitionDown:warning]: Cluster Interconnect link 0 is DOWN
Thu Feb 12 19:02:59 CET [node01: cf.nm.nicTransitionUp:info]: Interconnect link 0 is UP
Thu Feb 12 19:03:19 CET [node01: cf.rv.notConnected:error]: Connection for 'cfo_rv' failed.
Thu Feb 12 19:03:19 CET [node01: cf.rv.notConnected:error]: Connection for 'cfo_rv2' failed.
Thu Feb 12 19:03:28 CET [node01: cf.nm.nicTransitionDown:warning]: Cluster Interconnect link 1 is DOWN
Thu Feb 12 19:04:37 CET [node01: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 1 due to rendezvous connection timeout
Thu Feb 12 19:06:10 CET [node01: cf.fsm.takeoverByPartnerDisabled:notice]: Failover monitor: takeover of node01 by node02 disabled (unsynchronized log).
Thu Feb 12 19:08:16 CET [node01: cf.ic.qlgc.viErr:error]: Qlogic VI FC Adapter: ISP_CS_VI_ERROR vinum = 0xa state = 0x3 code = 0x6
Thu Feb 12 19:13:09 CET [node01: cf.fsm.takeoverByPartnerDisabled:notice]: Failover monitor: takeover of node01 by node02 disabled (interconnect error).
Thu Feb 12 19:18:14 CET [node01: cf.fsm.takeoverByPartnerDisabled:notice]: Failover monitor: takeover of node01 by node02 disabled (unsynchronized log).
Thu Feb 12 19:20:47 CET [node01: cf.fsm.takeoverOfPartnerDisabled:notice]: Failover monitor: takeover of node02 disabled (interconnect error).
Thu Feb 12 19:21:35 CET [node01: cf.fsm.takeoverByPartnerDisabled:notice]: Failover monitor: takeover of node01 by node02 disabled (unsynchronized log).
Thu Feb 12 19:22:24 CET [node01: cf.fsm.takeoverOfPartnerDisabled:notice]: Failover monitor: takeover of node02 disabled (interconnect error).
Thu Feb 12 19:29:13 CET [node01: cf.ic.qlgc.viErr:error]: Qlogic VI FC Adapter: ISP_CS_VI_ERROR vinum = 0x7 state = 0x3 code = 0x2
Thu Feb 12 19:29:33 CET [node01: cf.rv.notConnected:error]: Connection for 'cfo_rv' failed.
Thu Feb 12 19:29:33 CET [node01: cf.rv.notConnected:error]: Connection for 'cfo_rv2' failed.
Thu Feb 12 19:30:04 CET [node01: cf.nm.nicTransitionDown:warning]: Cluster Interconnect link 0 is DOWN
Thu Feb 12 19:30:05 CET [node01: cf.nm.nicTransitionDown:warning]: Cluster Interconnect link 1 is DOWN
Thu Feb 12 19:34:44 CET [node01: cf.fsm.takeoverByPartnerEnabled:notice]: Failover monitor: takeover of node01 by node02 enabled
Thu Feb 12 19:36:54 CET [node01: cf.ic.qlgc.viErr:error]: Qlogic VI FC Adapter: ISP_CS_VI_ERROR vinum = 0xa state = 0x3 code = 0x6
Thu Feb 12 19:45:51 CET [node01: raid.mirror.aggrSnapUse:warning]: Aggregate Snapshot copies are used in SyncMirror aggregate 'aggr0'. That is not recommended.
Thu Feb 12 19:51:57 CET [node01: cf.fsm.takeoverByPartnerDisabled:notice]: Failover monitor: takeover of node01 by node02 disabled (unsynchronized log).
Thu Feb 12 19:58:42 CET [node01: cf.fsm.takeoverOfPartnerEnabled:notice]: Failover monitor: takeover of node02 enabled
Thu Feb 12 20:00:16 CET [node01: cf.takeover.disabled:warning]: Controller Failover is licensed but takeover of partner is disabled due to reason : unsynchronized log.
Thu Feb 12 20:00:43 CET [node01: cf.fsm.takeoverByPartnerEnabled:notice]: Failover monitor: takeover of node01 by node02 enabled
Thu Feb 12 20:04:10 CET [node01: cf.fsm.takeoverOfPartnerEnabled:notice]: Failover monitor: takeover of node02 enabled
Thu Feb 12 20:05:29 CET [node01: cf.fsm.takeoverByPartnerDisabled:notice]: Failover monitor: takeover of node01 by node02 disabled (interconnect error).
Thu Feb 12 20:14:59 CET [node01: cf.fsm.takeoverByPartnerDisabled:notice]: Failover monitor: takeover of node01 by node02 disabled (interconnect error).
Thu Feb 12 20:15:44 CET [node01: cf.ic.qlgc.viErr:error]: Qlogic VI FC Adapter: ISP_CS_VI_ERROR vinum = 0xa state = 0x3 code = 0x6
Thu Feb 12 20:16:04 CET [node01: cf.rv.notConnected:error]: Connection for 'cfo_rv' failed.
Thu Feb 12 20:16:04 CET [node01: cf.rv.notConnected:error]: Connection for 'cfo_rv2' failed.
Thu Feb 12 20:16:05 CET [node01: cf.nm.nicTransitionUp:info]: Interconnect link 1 is UP
Thu Feb 12 20:16:06 CET [node01: cf.nm.nicTransitionDown:warning]: Cluster Interconnect link 0 is DOWN
Thu Feb 12 20:16:06 CET [node01: cf.nm.nicTransitionDown:warning]: Cluster Interconnect link 1 is DOWN
Fri Feb 13 01:00:17 CET [node02: cf.takeover.disabled:warning]: Controller Failover is licensed but takeover of partner is disabled due to reason : unsynchronized log.
Fri Feb 13 01:02:35 CET [node02: cf.fsm.takeoverByPartnerEnabled:notice]: Failover monitor: takeover of node02 by node01 enabled
Fri Feb 13 01:03:45 CET [node02: cf.ic.qlgc.viErr:error]: Qlogic VI FC Adapter: ISP_CS_VI_ERROR vinum = 0x8 state = 0x3 code = 0x6
Fri Feb 13 01:03:45 CET [node02: cf.nm.nicReset:warning]: Initiating soft reset on Cluster Interconnect card 0 due to ispfcvi2400 fatal VI error
Fri Feb 13 01:12:36 CET [node02: cf.nm.nicTransitionDown:warning]: Cluster Interconnect link 0 is DOWN
Fri Feb 13 01:12:42 CET [node02: cf.fsm.takeoverByPartnerEnabled:notice]: Failover monitor: takeover of node02 by node01 enabled
Fri Feb 13 01:18:52 CET [node02: cf.fsm.takeoverOfPartnerEnabled:notice]: Failover monitor: takeover of node01 enabled
Fri Feb 13 01:20:00 CET [node02: monitor.globalStatus.critical:CRITICAL]: Controller failover of node01 is not possible: unsynchronized log. /vol/db_p_mcs7_iscsi is full (using or reserving 98% of space and 0% of inodes, using 98% of reserve).
Fri Feb 13 01:39:33 CET [node02: cf.ic.qlgc.viErr:error]: Qlogic VI FC Adapter: ISP_CS_VI_ERROR vinum = 0xa state = 0x3 code = 0x6
Fri Feb 13 01:39:54 CET [node02: cf.nm.nicTransitionDown:warning]: Cluster Interconnect link 0 is DOWN
Fri Feb 13 01:39:54 CET [node02: cf.nm.nicTransitionDown:warning]: Cluster Interconnect link 1 is DOWN
Fri Feb 13 01:39:54 CET [node02: cf.rv.notConnected:error]: Connection for 'cfo_rv2' failed.
Fri Feb 13 01:52:48 CET [node02: ems.engine.inputSuppress:warning]: Event 'openssh.invalid.channel.req' suppressed 87 times since Fri Feb 13 00:00:04 CET 2015.
Fri Feb 13 01:52:48 CET [node02: openssh.invalid.channel.req:warning]: SSH client (SSH-2.0-OpenSSH_5.3) from 10.132.0.72 sent unsupported channel request (10, env).