To add to insult to injury, my want to check hardware compatibility. I usually would downgrade my hba's drivers to a known good (and older) version. The host utility usually sets timeouts and detects if its a ms cluster or not.. Sets the disk timeouts.
Date: Tue, 24 Jan 2012 06:30:54 -0800 From: fredgrieco@yahoo.com Subject: Re: How long can Windows survive loss of storage? To: skendric@fhcrc.org; toasters@teaparty.net
Stuart,
I would make sure your system is running single image mode clustering. The command "fcp show cfmode" will show you the clustering type for fcp clients. If it's single image mode, this will return "single_image." I'm fairly certain that single_image mode was available in 7.2.4.
FCP clustering is completely active-active in single image mode. Each client should see two paths on one side and two from the other. Losing one side just means losing two paths. If you are seeing FCP as out completely for 15 seconds on all paths, something's not right.
Fred
From: Stuart Kendrick skendric@fhcrc.org To: "toasters@teaparty.net" toasters@teaparty.net Sent: Tuesday, January 24, 2012 8:55 AM Subject: How long can Windows survive loss of storage?
Hi folks,
We have MS SQL Server (a handful of instances, riding on top of MS Cluster Services) attached via Fibre Channel to a filer.
Whenever we perform a takeover, SQL Server crumps. The DBAs shut it down, restart, repair/restore their databases (unhappy times), life returns to normal. We've replicated this behavior a handful of times: every single time we experience a takeover, whether administratively or Murphy-induced.
I'd like to think that we can do better than this.
This particular filer services five clients: the two MS SQL Servers via Fibre Channel and three Exchange servers via iSCSI. The iSCSI clients notice the takeover, but their initiators do the multipath thing and ride through the event fine.
In the log extract below (heavily edited), we can see the iSCSI and FCP services going down at 21:47:25. And coming back on-line at 21:47:40 (FCP) and 21:47:49 (iSCSI). The iSCSI client 'hamlet' re-establishes its session at 21:47:51. The rest of the log extract records the downward spiral of the Windows server 'custard-1'.
To my way of thinking, iSCSI was unavailable for 24 seconds, FCP for 15 seconds.
(1) Are the iSCSI and FCP services really and truly available once the 'fcp.service.startup' and 'iscsi.service.startup' messages are logged, or might there be a delay before storage is actually available to clients? (2) How could we shrink takeover times? (3) Why does MS Windows over FC have such trouble surviving a 15 second disruption to storage? (4) How could we increase MS Windows timeouts? (5) How long can /your/ MS Windows FCP clients survive an interruption to storage?
Toaster: FAS3020 running 7.2.4 SQL Server Clients: Windows 2008 running SnapDrive & SnapManager for SQL Server Exchange Clients: Windows 2003 running SnapDrive & SnapManager for Exchange
Log extract below
--sk
Stuart Kendrick FHCRC
Feb 7 21:47:22 toaster-b-svif2 [toaster-b: cf.misc.operatorTakeover:warning]: Cluster monitor: takeover initiated by operator Feb 7 21:47:22 toaster-b-svif2 [toaster-b: cf.fsm.nfo.acceptTakeoverReq:warning]: Negotiated failover: accepting takeover request by partner, reason: operator initiated cf takeover. Asking partner to shutdown gracefully; will takeover in at most 180 seconds. Feb 7 21:47:25 toaster-a-svif2 [toaster-a: cf.fsm.nfo.startingGracefulShutdown:warning]: Negotiated failover: starting graceful shutdown. Feb 7 21:47:25 toaster-a-svif2 [toaster-a: kern.shutdown:notice]: System shut down because : "reboot". Feb 7 21:47:25 toaster-a-svif2 [toaster-a: iscsi.service.shutdown:info]: iSCSI service shutdown Feb 7 21:47:25 toaster-a-svif2 [toaster-a: fcp.service.shutdown:info]: FCP service shutdown Feb 7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.firmwareStatus:info]: Cluster monitor: partner rebooting Feb 7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.nfo.partnerShutdown:warning]: Negotiated failover: partner has shutdown Feb 7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.takeover.nfo:info]: Cluster monitor: takeover attempted after 'cf takeover'. command Feb 7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.stateTransit:warning]: Cluster monitor: UP --> TAKEOVER Feb 7 21:47:32 toaster-b-svif2 [toaster-b: cf.fm.takeoverStarted:warning]: Cluster monitor: takeover started Feb 7 21:47:34 toaster-b-svif2 [toaster-b: cf_takeover:info]: NVRAM takeover: partner nvram is disabled Feb 7 21:47:39 toaster-b-svif2 [toaster-a/toaster-b: wafl.takeover.nvram.missing:error]: WAFL takeover: no partner area found during wafl replay Feb 7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b: wafl.replay.done:info]: WAFL log replay completed, 0 seconds Feb 7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b: fcp.service.startup:info]: FCP service startup
1 sys Error custard-1 2010-02-07 21:47:44 61203 ontapdsm DSM ID 0400010d has transitioned to the failed state. 1 sys Error custard-1 2010-02-07 21:47:44 61203 ontapdsm DSM ID 04000107 has transitioned to the failed state. 1 sys Error custard-1 2010-02-07 21:47:44 61203 ontapdsm DSM ID 0400010b has transitioned to the failed state. 1 sys Error custard-1 2010-02-07 21:47:44 61203 ontapdsm DSM ID 0300010b has transitioned to the failed state. [...] 1 sys Error custard-1 2010-02-07 21:47:44 61142 ontapdsm Nexus ID 03000101 has failed.
Feb 7 21:47:48 toaster-b-svif2 [toaster-a/toaster-b: cf.fm.takeoverDetectionSeconds.Default:warning]: option cf.takeover.detection.seconds is set to 10 seconds which is below the NetApp advised value of 15 seconds. False takeovers and/or takeovers without diagnostic core-dumps might occur. Feb 7 21:47:49 toaster-b-svif2 [toaster-a/toaster-b: iscsi.service.startup:info]: iSCSI service startup Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover): cf.rsrc.transitTime:notice]: Top Takeover transit times wafl_restart=7400 {vdisk=7214, restarters=186}, wafl=5049, registry_prerc=613, registry_postrc_phase2=555, rc=553 {ifconfig=99, ifconfig=93, hostname=57, options=41, options=27, options=26, route=2, vif=1, vif=1, vif=1}, raid=525, registry_postrc_phase1=403, wafl_sync=241, raid_replay=221, cifs=216 Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover): cf.fm.takeoverComplete:warning]: Cluster monitor: takeover completed Feb 7 21:47:50 toaster-b-svif2 [toaster-b (takeover): cf.fm.takeoverDuration:warning]: Cluster monitor: takeover duration time is 18 seconds Feb 7 21:47:51 toaster-b-svif2 [toaster-a/toaster-b: iscsi.notice:notice]: ISCSI: New session from initiator iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44 Feb 7 21:47:51 toaster-a-svif1 [toaster-a/toaster-b: iscsi.notice:notice]: ISCSI: New session from initiator iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44
1 sys Warning custard-1 2010-02-07 21:47:57 61051 ontapdsm DSM ID 03000100 failed path verification. 1 sys Warning custard-1 2010-02-07 21:47:57 61051 ontapdsm DSM ID 03000101 failed path verification. 1 sys Warning custard-1 2010-02-07 21:47:57 61051 ontapdsm DSM ID 03000102 failed path verification. 1 sys Warning custard-1 2010-02-07 21:47:57 61077 ontapdsm DSM ID 03000111 has initiated a fail-over. [...] 1 sys Error custard-1 2010-02-07 21:47:57 61124 ontapdsm The port servicing DSM ID 03000111 reported the logical unit did not respond to selection. [...]
Feb 7 21:48:01 toaster-b-svif2 [toaster-b (takeover): monitor.globalStatus.critical:CRITICAL]: This node has taken over toaster-a. Feb 7 21:48:01 toaster-b-svif2 [toaster-a/toaster-b: monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this node. Feb 7 21:48:01 toaster-a-svif1 [toaster-a/toaster-b: monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this node.
1 sys Warning custard-1 2010-02-07 21:48:04 61204 ontapdsm DSM ID 0400010f is in the process of being removed. 1 sys Warning custard-1 2010-02-07 21:48:04 61205 ontapdsm DSM ID 0400010f was removed. 1 sys Warning custard-1 2010-02-07 21:48:04 61205 ontapdsm DSM ID 0400010e was removed. [...] 1 sys Error custard-1 2010-02-07 21:48:04 16 mpio A fail-over on DeviceMPIODisk108 occurred. [...] 1 sys Warning custard-1 2010-02-07 21:48:04 17 mpio DeviceMPIODisk108 is currently in a degraded state. One or more paths have failed though the process is now complete.
Feb 7 21:48:09 toaster-b-svif2 [toaster-a/toaster-b: syslogd:info]: syslogd: restarted Feb 7 21:48:09 toaster-a-svif1 [toaster-a/toaster-b: syslogd:info]: syslogd: restarted Feb 7 21:48:11 toaster-b-svif2 [toaster-a/toaster-b: nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations have completed for the partner server. Feb 7 21:48:11 toaster-a-svif1 [toaster-a/toaster-b: nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations have completed for the partner server.
1 sys Warning custard-1 2010-02-07 21:48:24 129 ql2300 Reset to device DeviceRaidPort1 was issued. 1 sys Warning custard-1 2010-02-07 21:48:28 1123 ClusSvc The node lost communication with cluster node 'custard-2' on network 'Public (Team)'. 1 sys Warning custard-1 2010-02-07 21:48:28 1123 ClusSvc The node lost communication with cluster node 'custard-2' on network 'Private 172.16 (PCI)'. 1 sys Warning custard-1 2010-02-07 21:48:28 1123 ClusSvc The node lost communication with cluster node 'custard-2' on network 'Private 192.168 (Onboard)'. 1 sys Error custard-1 2010-02-07 21:48:43 1209 ClusDisk Cluster service is requesting a bus reset for device DeviceClusDisk0.
Feb 7 21:48:46 toaster-b-svif2 [toaster-b (takeover): scsitarget.ispfct.targetReset:notice]: FCP Target 0b: Target was Reset by the Initiator at Port Id: 0x10000 (WWPN 2100001b3201ba33)
1 sys Error custard-1 2010-02-07 21:48:51 1209 ClusDisk Cluster service is requesting a bus reset for device DeviceClusDisk0. 1 sys Error custard-1 2010-02-07 21:48:53 1118 ClusNet Cluster service was terminated as requested by Node 2. 1 sys Error custard-1 2010-02-07 21:48:53 7031 Service Control Manager The Cluster Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service. 5 sys Error custard-1 2010-02-07 21:48:54 1036 ClusSvc Cluster disk resource '' did not respond to a SCSI maintenance command. 1 sys Error custard-1 2010-02-07 21:48:54 1215 ClusSvc Cluster Network Name custard is no longer registered with its hosting system. The associated resource name is ''. 1 sys Error custard-1 2010-02-07 21:48:54 1077 ClusSvc The TCP/IP interface for Cluster IP Address '' has failed. 1 sys Warning custard-1 2010-02-07 21:48:56 50 Ntfs {Delayed Write Failed} Windows was unable to save all the data for the file . The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere. 1 sys Warning custard-1 2010-02-07 21:48:57 57 Ftdisk The system failed to flush data to the transaction log. Corruption may occur.
_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters