On Mon, 28 Jun 1999, Shaun T. Erickson wrote:
My experience with NetApp has not been very good, yet I know that many others have few if any troubles with them. While it didn't leave me with a particularly good first impression, I'm hoping that this F740 will meet our needs and improve our experience with NetApp.
We have a couple of clustered F740's in pre-production testing, and immediately after reading your original message about FC-AL controller failure, I found this on one of them:
Sun Jun 20 01:00:02 EDT [nfs3: raid_scrub_admin]: Beginning disk scrubbing... Sun Jun 20 01:18:50 EDT [nfs3: isp2100_timeout]: 0a.2 (0xfffffc0000abf258,0x28:00155d88:0040,0/0,1189063/0/0,9156358/0): command timeout, aborting request Sun Jun 20 01:18:55 EDT [nfs3: isp2100_timeout]: 0a.4 (0xfffffc0000b8a118,0x28:00155dc8:0040,0/0,1186670/0/0,9156358/0): command timeout, aborting request Sun Jun 20 01:19:00 EDT [nfs3: isp2100_timeout]: 0a.5 (0xfffffc0000aea4f8,0x2a:0000a3f8:0008,0/0,2129956/0/0,9156358/0): command timeout, aborting request Sun Jun 20 01:19:06 EDT [nfs3: isp2100_timeout]: 0a.6 (0xfffffc0000adf3d8,0x28:0000a3f0:0008,0/0,2131151/0/0,9156358/0): command timeout, aborting request Sun Jun 20 01:19:14 EDT [nfs3: isp2100_timeout]: Resetting ISP2100 in slot 0a Sun Jun 20 05:34:04 GMT [nfs3: cf_main]: Cluster monitor: takeover of partner enabled Sun Jun 20 05:34:08 GMT [nfs3: rc]: de_main: e0 : Link up. Sun Jun 20 01:34:08 EDT [nfs3: rc]: saving 165M to /etc/crash/core.0.nz ("WAFL hung.") Sun Jun 20 01:34:34 EDT [nfs3: rc]: relog syslog Sun Jun 20 01:19:14 EDT [nfs3: isp2100_timeout]: Resetting ISP2100 in slot 0a Sun Jun 20 01:34:34 EDT [nfs3: rc]: NetApp Release 5.2.2 boot complete. Last disk update written at Sun Jun 20 01:00:07 EDT 1999
I then tried forcing a disk failure to see if my disk reconstruction would produce the same results yours did. It did, except mine never went beyond 4%. It didn't crash either... it just sort of stayed "stuck". I was able to force a cluster takeover from the other filer, and it continued the reconstruction, but something else messed up which caused the takeover filer not to assume the IP address of it's twin:
Tue Jun 29 22:16:49 EDT [nfs3: raid_disk_admin]: Unload of disk 0a.5 (S/N LK95551800002941HH3U) has completed successfully. Tue Jun 29 22:17:10 EDT [nfs3: rshd_0]: Option raid.reconstruct_speed changed on one cluster node Tue Jun 29 22:21:46 EDT [nfs3: isp2100_timeout]: 0a.3 (0xfffffc0000b514b8,0x28:00155cc0:0040,0/0,52562/0/0,521771/0): command timeout, aborting request Tue Jun 29 22:21:51 EDT [nfs3: isp2100_timeout]: 0a.4 (0xfffffc0000b49ad8,0x28:00155c40:0040,1/0,28029/0/0,521771/0): command timeout, aborting request Tue Jun 29 22:21:56 EDT [nfs3: isp2100_timeout]: 0a.6 (0xfffffc0000b50618,0x28:00155cc0:0040,0/0,193448/0/0,521771/0): command timeout, aborting request Tue Jun 29 22:22:04 EDT [nfs3: isp2100_timeout]: Resetting ISP2100 in slot 0a
And then on the partner filer:
Tue Jun 29 22:27:49 EDT [nfs4: rc]: Cluster monitor: takeover initiated by operator Tue Jun 29 22:27:49 EDT [nfs4: cf_main]: Cluster monitor: UP --> TAKEOVER Tue Jun 29 22:27:49 EDT [nfs4: cf_takeover]: Cluster monitor: takeover started Tue Jun 29 22:27:50 EDT [nfs4: de0]: de_main: e3 : Link down. Check cable. Tue Jun 29 22:28:02 EDT [nfs4: disk_admin]: Resetting all devices on ISP2100 in slot 1 Tue Jun 29 22:28:09 EDT [nfs4: raid_disk_admin]: Label write on 1.5 (S/N LK95551800002941HH3U) failed. Tue Jun 29 22:28:09 EDT [nfs4: raid_disk_admin]: One disk is missing from volume partner:vol0, RAID group 0. A "hot spare" disk is available and the missing disk will be reconstructed on the spare disk. Tue Jun 29 22:28:10 EDT [nfs3/nfs4: cf_takeover]: ifconfig: 'e0' cannot be configured: no partner. Tue Jun 29 22:28:10 EDT [nfs3/nfs4: cf_takeover]: relog syslog Tue Jun 29 22:21:56 EDT [nfs3: isp2100_timeout]: 0a.6 (0xfffffc0000b50618,0x28:00155cc0:0040,0/0,193448/0/0,521771/0): Tue Jun 29 22:28:10 EDT [nfs3/nfs4: cf_takeover]: relog syslog Tue Jun 29 22:22:04 EDT [nfs3: isp2100_timeout]: Resetting ISP2100 in slot 0a Tue Jun 29 22:28:10 EDT [nfs4: cf_takeover]: WARNING: 1 error detected during network takeover processing Tue Jun 29 22:28:10 EDT [nfs4: cf_takeover]: WARNING: Some network clients may not be able to access the cluster during takeover Tue Jun 29 22:28:10 EDT [nfs4: cf_takeover]: Cluster monitor: takeover during ifconfig_2 failed; takeover continuing... Tue Jun 29 22:28:10 EDT [nfs4: cf_takeover]: Cluster monitor: takeover completed
Anyway, this sounds and tastes like a FibreChannel controller failure too.