Hi Jay,
we also had a ' isp2100_timeout]: Resetting ISP2100 in slot ..' a few weeks ago. F740 , DOT 5.2.3, FC-AL 18 GB, Gbit Card.
!! Be careful !!
Replacing motherboard and FC-Adapter-Card at last solved the problem, after we replaced all cables, all shelves of this loop, and about 3 disks.
Mit freundlichen Grüßen
Markus Bentele MTU Friedrichshafen GmbH * 07541-90-2654 * bentele@mtu-friedrichshafen.com
Von: Jay Soffian[SMTP:jay@cimedia.com] Gesendet: Mittwoch, 10. November 1999 08:07 An: toasters@mathworks.com Betreff: 'Loop break detected' followed by failover
I've opened a case with NOW on this.
One of my filers (F740/512MB/28x9GB disks/DOT 5.2.3) tonight just did this:
Tue Nov 9 19:00:01 EST [subzero: statd]: 7:00pm up 44 days, 13:45 2570782327 NFS ops, 0 CIFS ops, 0 HTTP ops Tue Nov 9 19:16:46 EST [subzero: isp2100_main]: Loop break detected on ISP2100 in slot 1. Tue Nov 9 19:16:50 EST [subzero: isp2100_timeout]: Resetting ISP2100 in slot 1 Tue Nov 9 19:19:14 EST [subzero/viking: cf_takeover]: relog syslog Tue Nov 9 19:17:15 EST [subzero: isp2100_timeout]: Resetting ISP2100 in slot 1 Tue Nov 9 19:19:14 EST [subzero/viking: cf_takeover]: relog syslog Tue Nov 9 19:17:27 EST [subzero: isp2100_timeout]: Resetting ISP2100 in slot 1 Tue Nov 9 19:19:15 EST [subzero/viking: asup_main]: Cluster Notification mail sent
No core dump, I found the filer at an 'ok' prompt.
I rebooted the filer, did a cf giveback a few minutes later and all appears fine now.
Anything recommended besides checking that all cables are tight and all drives are fully seated?
I couldn't find any bugs relating to this on NOW (except a bug in the Diagnostics not resetting the FC-AL loop properly when a loop break occurs). I've heard rumors that some F740's were shipped with bad on-board FC-AL controllers, but this F740 has been in operation for over a year w/o any problems (it's only been clustered since June though).
Suggestions?
j.
On Wed, 10 Nov 1999 bentele@mtu-friedrichshafen.com wrote:
we also had a ' isp2100_timeout]: Resetting ISP2100 in slot ..' a few weeks ago. F740 , DOT 5.2.3, FC-AL 18 GB, Gbit Card.
!! Be careful !!
Replacing motherboard and FC-Adapter-Card at last solved the problem, after we replaced all cables, all shelves of this loop, and about 3 disks.
I think there is a known (but perhaps not widely acknowledged) problem with the on-board FC-AL interface on the F740 motherboards (and possibly on other models as well). I have four F740's in production, of which two have had histories of flaky FC-AL (ISP2100 timeouts during disk scrubs, hung RAID reconstructions, simultaneous errors across all drives on a shelf, etc.). Netapp has sent up four slot-based FC-AL adapters and told me to use those instead of the on-board ones.
Another shipment of four F740's arrived earlier this week, and I noticed they came with two slot-based FC-AL adapters each (these are clustered systems). I take that as an indication that Netapp still believes there is some sort of defect on the F740 motherboards.
Brian Tao wrote:
Another shipment of four F740's arrived earlier this week, and I
noticed they came with two slot-based FC-AL adapters each (these are clustered systems). I take that as an indication that Netapp still believes there is some sort of defect on the F740 motherboards.
We've run into some trouble with the onboard FC-AL on F700's. It's interesting that they are now shipping with PCI cards installed.
Will Netapp be sending people free PCI FC-AL cards for existing F700 systems?
Graham
"Graham" == Graham C Knight grahamk@ast.lmco.com writes:
Graham> We've run into some trouble with the onboard FC-AL on Graham> F700's. It's interesting that they are now shipping with Graham> PCI cards installed.
Graham> Will Netapp be sending people free PCI FC-AL cards for Graham> existing F700 systems?
NA just sent me a pair of PCI FC-AL's to replace the on-board FC-AL's. Apparently, the on-board FC-AL in filer A detected a loop break on its partners disks and that caused filer A to failover to filer B. It seems odd that filer A would be the one to failover upon detecting a loop break on its secondary FC-AL loop.
I've got a question about upgrading the FC-AL controllers though.
Two options for upgrading:
1) turn everything off, install the FC-AL contollers, move each filer's A-loop, turn everthing back on.
2) failover filer A, turn filer A off, upgrade its FC-AL controller, move its A loop, turn it back on, giveback to filer A from filer B. Then do the same on filer B.
This should be no different then adding a shelf to a cluster. Last time we did that, we used scenario (2). It didn't go smoothly, but that's because we were bitten by the 44+ days uptime bug.
j.
On Fri, 12 Nov 1999, Jay Soffian wrote:
- failover filer A, turn filer A off, upgrade its FC-AL controller,
move its A loop, turn it back on, giveback to filer A from filer B. Then do the same on filer B.
This is what we did on a pair of filers with Oracle data. For added safety, we temporarily flipped Oracle to archive mode, logging to local disk, but it wasn't needed as the upgrade went exactly as planned.
NetApp support informed me that they now ship all F760's with the onboard FC-AL adapter terminated and add an extra pci FC-AL adapter instead. I was also told that the onboard controller most likely caused several problems that we experienced like: 'watchdog resets', 'UNCORR PROC 98' errors, Netapp momentarily disappearing from the network, and 'disk underrun' errors. Terminating the onboard controller and adding a new pci FC-AL controller has seemed to fix all these problems for us.
David Midgett Manager, Server Operations About.com, Inc. The Leading Network of Niche Vertical Sites. http://About.com
-----Original Message----- From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com]On Behalf Of Brian Tao Sent: Thursday, November 11, 1999 7:22 PM To: bentele@mtu-friedrichshafen.com Cc: toasters@mathworks.com; jay@cimedia.com Subject: Re: AW: 'Loop break detected' followed by failover
On Wed, 10 Nov 1999 bentele@mtu-friedrichshafen.com wrote:
we also had a ' isp2100_timeout]: Resetting ISP2100 in slot ..' a few weeks ago. F740 , DOT 5.2.3, FC-AL 18 GB, Gbit Card.
!! Be careful !!
Replacing motherboard and FC-Adapter-Card at last solved the problem, after we replaced all cables, all shelves of this loop, and about 3 disks.
I think there is a known (but perhaps not widely acknowledged) problem with the on-board FC-AL interface on the F740 motherboards (and possibly on other models as well). I have four F740's in production, of which two have had histories of flaky FC-AL (ISP2100 timeouts during disk scrubs, hung RAID reconstructions, simultaneous errors across all drives on a shelf, etc.). Netapp has sent up four slot-based FC-AL adapters and told me to use those instead of the on-board ones.
Another shipment of four F740's arrived earlier this week, and I noticed they came with two slot-based FC-AL adapters each (these are clustered systems). I take that as an indication that Netapp still believes there is some sort of defect on the F740 motherboards. -- Brian Tao (BT300, taob@risc.org) "Though this be madness, yet there is method in't"