Re: Disk Controller Failures - toasters

30 Jun 1999


      On Mon, 28 Jun 1999, Shaun T. Erickson wrote:
...
My experience with NetApp has not been very good, yet I know that
many others have few if any troubles with them. While it didn't
leave me with a particularly good first impression, I'm hoping that
this F740 will meet our needs and improve our experience with
NetApp.
We have a couple of clustered F740's in pre-production testing,
and immediately after reading your original message about FC-AL
controller failure, I found this on one of them:
Sun Jun 20 01:00:02 EDT [nfs3: raid_scrub_admin]: Beginning disk scrubbing...
Sun Jun 20 01:18:50 EDT [nfs3: isp2100_timeout]: 0a.2 (0xfffffc0000abf258,0x28:00155d88:0040,0/0,1189063/0/0,9156358/0): command timeout, aborting request
Sun Jun 20 01:18:55 EDT [nfs3: isp2100_timeout]: 0a.4 (0xfffffc0000b8a118,0x28:00155dc8:0040,0/0,1186670/0/0,9156358/0): command timeout, aborting request
Sun Jun 20 01:19:00 EDT [nfs3: isp2100_timeout]: 0a.5 (0xfffffc0000aea4f8,0x2a:0000a3f8:0008,0/0,2129956/0/0,9156358/0): command timeout, aborting request
Sun Jun 20 01:19:06 EDT [nfs3: isp2100_timeout]: 0a.6 (0xfffffc0000adf3d8,0x28:0000a3f0:0008,0/0,2131151/0/0,9156358/0): command timeout, aborting request
Sun Jun 20 01:19:14 EDT [nfs3: isp2100_timeout]: Resetting ISP2100 in slot 0a
Sun Jun 20 05:34:04 GMT [nfs3: cf_main]: Cluster monitor: takeover of partner enabled
Sun Jun 20 05:34:08 GMT [nfs3: rc]: de_main: e0 : Link up.
Sun Jun 20 01:34:08 EDT [nfs3: rc]: saving 165M to /etc/crash/core.0.nz ("WAFL hung.")
Sun Jun 20 01:34:34 EDT [nfs3: rc]: relog syslog Sun Jun 20 01:19:14 EDT [nfs3: isp2100_timeout]: Resetting ISP2100 in slot 0a
Sun Jun 20 01:34:34 EDT [nfs3: rc]: NetApp Release 5.2.2 boot complete.  Last disk update written at Sun Jun 20 01:00:07 EDT 1999
I then tried forcing a disk failure to see if my disk
reconstruction would produce the same results yours did.  It did,
except mine never went beyond 4%.  It didn't crash either... it just
sort of stayed "stuck".  I was able to force a cluster takeover from
the other filer, and it continued the reconstruction, but something
else messed up which caused the takeover filer not to assume the IP
address of it's twin:
Tue Jun 29 22:16:49 EDT [nfs3: raid_disk_admin]: Unload of disk 0a.5 (S/N LK95551800002941HH3U) has completed successfully.
Tue Jun 29 22:17:10 EDT [nfs3: rshd_0]: Option raid.reconstruct_speed changed on one cluster node 
Tue Jun 29 22:21:46 EDT [nfs3: isp2100_timeout]: 0a.3 (0xfffffc0000b514b8,0x28:00155cc0:0040,0/0,52562/0/0,521771/0): command timeout, aborting request
Tue Jun 29 22:21:51 EDT [nfs3: isp2100_timeout]: 0a.4 (0xfffffc0000b49ad8,0x28:00155c40:0040,1/0,28029/0/0,521771/0): command timeout, aborting request
Tue Jun 29 22:21:56 EDT [nfs3: isp2100_timeout]: 0a.6 (0xfffffc0000b50618,0x28:00155cc0:0040,0/0,193448/0/0,521771/0): command timeout, aborting request
Tue Jun 29 22:22:04 EDT [nfs3: isp2100_timeout]: Resetting ISP2100 in slot 0a
And then on the partner filer:
Tue Jun 29 22:27:49 EDT [nfs4: rc]: Cluster monitor: takeover initiated by operator
Tue Jun 29 22:27:49 EDT [nfs4: cf_main]: Cluster monitor: UP --> TAKEOVER
Tue Jun 29 22:27:49 EDT [nfs4: cf_takeover]: Cluster monitor: takeover started
Tue Jun 29 22:27:50 EDT [nfs4: de0]: de_main: e3 : Link down. Check cable.
Tue Jun 29 22:28:02 EDT [nfs4: disk_admin]: Resetting all devices on ISP2100 in slot 1
Tue Jun 29 22:28:09 EDT [nfs4: raid_disk_admin]: Label write on 1.5 (S/N LK95551800002941HH3U) failed.
Tue Jun 29 22:28:09 EDT [nfs4: raid_disk_admin]: One disk is missing from volume partner:vol0, RAID group 0.
        A "hot spare" disk is available and the missing disk
        will be reconstructed on the spare disk.
Tue Jun 29 22:28:10 EDT [nfs3/nfs4: cf_takeover]: ifconfig: 'e0' cannot be configured: no partner.
Tue Jun 29 22:28:10 EDT [nfs3/nfs4: cf_takeover]: relog syslog Tue Jun 29 22:21:56 EDT [nfs3: isp2100_timeout]: 0a.6 (0xfffffc0000b50618,0x28:00155cc0:0040,0/0,193448/0/0,521771/0):
Tue Jun 29 22:28:10 EDT [nfs3/nfs4: cf_takeover]: relog syslog Tue Jun 29 22:22:04 EDT [nfs3: isp2100_timeout]: Resetting ISP2100 in slot 0a
Tue Jun 29 22:28:10 EDT [nfs4: cf_takeover]: WARNING: 1 error detected during network takeover processing
Tue Jun 29 22:28:10 EDT [nfs4: cf_takeover]: WARNING: Some network clients may not be able to access the cluster during takeover
Tue Jun 29 22:28:10 EDT [nfs4: cf_takeover]: Cluster monitor: takeover during ifconfig_2 failed; takeover continuing...
Tue Jun 29 22:28:10 EDT [nfs4: cf_takeover]: Cluster monitor: takeover completed
Anyway, this sounds and tastes like a FibreChannel controller
failure too.
-- 
Brian Tao (BT300, taob@risc.org)
"Though this be madness, yet there is method in't"