How long can Windows survive loss of storage? - toasters

24 Jan 2012


      Hi folks,
We have MS SQL Server (a handful of instances, riding on top of MS
Cluster Services) attached via Fibre Channel to a filer.
Whenever we perform a takeover, SQL Server crumps.  The DBAs shut it
down, restart, repair/restore their databases (unhappy times), life
returns to normal.  We've replicated this behavior a handful of times: 
every single time we experience a takeover, whether administratively or
Murphy-induced.
I'd like to think that we can do better than this.
This particular filer services five clients:  the two MS SQL Servers via
Fibre Channel and three Exchange servers via iSCSI.  The iSCSI clients
notice the takeover, but their initiators do the multipath thing and
ride through the event fine.
In the log extract below (heavily edited), we can see the iSCSI and FCP
services going down at 21:47:25.  And coming back on-line at 21:47:40
(FCP) and 21:47:49 (iSCSI).  The iSCSI client 'hamlet' re-establishes
its session at 21:47:51.  The rest of the log extract records the
downward spiral of the Windows server 'custard-1'.
To my way of thinking, iSCSI was unavailable for 24 seconds, FCP for 15
seconds.
(1) Are the iSCSI and FCP services really and truly available once the
'fcp.service.startup' and 'iscsi.service.startup' messages are logged,
or might there be a delay before storage is actually available to clients?
(2) How could we shrink takeover times?
(3) Why does MS Windows over FC have such trouble surviving a 15 second
disruption to storage?
(4) How could we increase MS Windows timeouts?
(5) How long can /your/ MS Windows FCP clients survive an interruption
to storage?
Toaster:  FAS3020 running 7.2.4
SQL Server Clients:  Windows 2008 running SnapDrive & SnapManager for
SQL Server
Exchange Clients:  Windows 2003 running SnapDrive & SnapManager for Exchange
Log extract below
--sk
Stuart Kendrick
FHCRC
Feb  7 21:47:22 toaster-b-svif2 [toaster-b:
cf.misc.operatorTakeover:warning]: Cluster monitor: takeover initiated
by operator
Feb  7 21:47:22 toaster-b-svif2 [toaster-b:
cf.fsm.nfo.acceptTakeoverReq:warning]: Negotiated failover: accepting
takeover request by partner, reason: operator initiated cf takeover.
Asking partner to shutdown gracefully; will takeover in at most 180 seconds.
Feb  7 21:47:25 toaster-a-svif2 [toaster-a:
cf.fsm.nfo.startingGracefulShutdown:warning]: Negotiated failover:
starting graceful shutdown.
Feb  7 21:47:25 toaster-a-svif2 [toaster-a: kern.shutdown:notice]:
System shut down because : "reboot".
Feb  7 21:47:25 toaster-a-svif2 [toaster-a:
iscsi.service.shutdown:info]: iSCSI service shutdown
Feb  7 21:47:25 toaster-a-svif2 [toaster-a: fcp.service.shutdown:info]:
FCP service shutdown
Feb  7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.firmwareStatus:info]:
Cluster monitor: partner rebooting
Feb  7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fsm.nfo.partnerShutdown:warning]: Negotiated failover: partner has
shutdown
Feb  7 21:47:32 toaster-b-svif2 [toaster-b: cf.fsm.takeover.nfo:info]:
Cluster monitor: takeover attempted after 'cf takeover'. command
Feb  7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fsm.stateTransit:warning]: Cluster monitor: UP --> TAKEOVER
Feb  7 21:47:32 toaster-b-svif2 [toaster-b:
cf.fm.takeoverStarted:warning]: Cluster monitor: takeover started
Feb  7 21:47:34 toaster-b-svif2 [toaster-b: cf_takeover:info]: NVRAM
takeover: partner nvram is disabled
Feb  7 21:47:39 toaster-b-svif2 [toaster-a/toaster-b:
wafl.takeover.nvram.missing:error]: WAFL takeover: no partner area found
during wafl replay
Feb  7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b:
wafl.replay.done:info]: WAFL log replay completed, 0 seconds
Feb  7 21:47:40 toaster-b-svif2 [toaster-a/toaster-b:
fcp.service.startup:info]: FCP service startup
1    sys    Error    custard-1    2010-02-07 21:47:44    61203   
ontapdsm        DSM ID 0400010d has transitioned to the failed state.
1    sys    Error    custard-1    2010-02-07 21:47:44    61203   
ontapdsm        DSM ID 04000107 has transitioned to the failed state.
1    sys    Error    custard-1    2010-02-07 21:47:44    61203   
ontapdsm        DSM ID 0400010b has transitioned to the failed state.
1    sys    Error    custard-1    2010-02-07 21:47:44    61203   
ontapdsm        DSM ID 0300010b has transitioned to the failed state.
[...]
1    sys    Error    custard-1    2010-02-07 21:47:44    61142   
ontapdsm        Nexus ID 03000101 has failed.
Feb  7 21:47:48 toaster-b-svif2 [toaster-a/toaster-b:
cf.fm.takeoverDetectionSeconds.Default:warning]: option
cf.takeover.detection.seconds is set to 10 seconds which is below the
NetApp advised value of 15 seconds. False takeovers and/or takeovers
without diagnostic core-dumps might occur.
Feb  7 21:47:49 toaster-b-svif2 [toaster-a/toaster-b:
iscsi.service.startup:info]: iSCSI service startup
Feb  7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.rsrc.transitTime:notice]: Top Takeover transit times
wafl_restart=7400 {vdisk=7214, restarters=186}, wafl=5049,
registry_prerc=613, registry_postrc_phase2=555, rc=553 {ifconfig=99,
ifconfig=93, hostname=57, options=41, options=27, options=26, route=2,
vif=1, vif=1, vif=1}, raid=525, registry_postrc_phase1=403,
wafl_sync=241, raid_replay=221, cifs=216
Feb  7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.fm.takeoverComplete:warning]: Cluster monitor: takeover completed
Feb  7 21:47:50 toaster-b-svif2 [toaster-b (takeover):
cf.fm.takeoverDuration:warning]: Cluster monitor: takeover duration time
is 18 seconds
Feb  7 21:47:51 toaster-b-svif2 [toaster-a/toaster-b:
iscsi.notice:notice]: ISCSI: New session from initiator
iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44
Feb  7 21:47:51 toaster-a-svif1 [toaster-a/toaster-b:
iscsi.notice:notice]: ISCSI: New session from initiator
iqn.1990-04.org.fhcrc:hamlet.fhcrc.org at IP addr 10.111.152.44
1    sys    Warning    custard-1    2010-02-07 21:47:57    61051   
ontapdsm        DSM ID 03000100 failed path verification.
1    sys    Warning    custard-1    2010-02-07 21:47:57    61051   
ontapdsm        DSM ID 03000101 failed path verification.
1    sys    Warning    custard-1    2010-02-07 21:47:57    61051   
ontapdsm        DSM ID 03000102 failed path verification.
1    sys    Warning    custard-1    2010-02-07 21:47:57    61077   
ontapdsm        DSM ID 03000111 has initiated a fail-over.  
[...]
1    sys    Error    custard-1    2010-02-07 21:47:57    61124   
ontapdsm        The port servicing DSM ID 03000111 reported the logical
unit did not respond to selection.
[...]
Feb  7 21:48:01 toaster-b-svif2 [toaster-b (takeover):
monitor.globalStatus.critical:CRITICAL]: This node has taken over
toaster-a.
Feb  7 21:48:01 toaster-b-svif2 [toaster-a/toaster-b:
monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this
node.
Feb  7 21:48:01 toaster-a-svif1 [toaster-a/toaster-b:
monitor.globalStatus.critical:CRITICAL]: toaster-b has taken over this
node.
1    sys    Warning    custard-1    2010-02-07 21:48:04    61204   
ontapdsm        DSM ID 0400010f is in the process of being removed.
1    sys    Warning    custard-1    2010-02-07 21:48:04    61205   
ontapdsm        DSM ID 0400010f was removed.
1    sys    Warning    custard-1    2010-02-07 21:48:04    61205   
ontapdsm        DSM ID 0400010e was removed.
[...]
1    sys    Error    custard-1    2010-02-07 21:48:04    16    mpio   
    A fail-over on DeviceMPIODisk108 occurred.
[...]
1    sys    Warning    custard-1    2010-02-07 21:48:04    17    mpio   
    DeviceMPIODisk108 is currently in a degraded state. One or more
paths have failed  though the process is now complete.
Feb  7 21:48:09 toaster-b-svif2 [toaster-a/toaster-b: syslogd:info]:
syslogd: restarted
Feb  7 21:48:09 toaster-a-svif1 [toaster-a/toaster-b: syslogd:info]:
syslogd: restarted
Feb  7 21:48:11 toaster-b-svif2 [toaster-a/toaster-b:
nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations
have completed for the partner server.
Feb  7 21:48:11 toaster-a-svif1 [toaster-a/toaster-b:
nbt.nbns.registrationComplete:info]: NBT: All CIFS name registrations
have completed for the partner server.
1    sys    Warning    custard-1    2010-02-07 21:48:24    129   
ql2300        Reset to device  DeviceRaidPort1  was issued.
1    sys    Warning    custard-1    2010-02-07 21:48:28    1123   
ClusSvc        The node lost communication with cluster node 'custard-2'
on network 'Public (Team)'.
1    sys    Warning    custard-1    2010-02-07 21:48:28    1123   
ClusSvc        The node lost communication with cluster node 'custard-2'
on network 'Private 172.16 (PCI)'.
1    sys    Warning    custard-1    2010-02-07 21:48:28    1123   
ClusSvc        The node lost communication with cluster node 'custard-2'
on network 'Private 192.168 (Onboard)'.
1    sys    Error    custard-1    2010-02-07 21:48:43    1209   
ClusDisk        Cluster service is requesting a bus reset for device
DeviceClusDisk0.
Feb  7 21:48:46 toaster-b-svif2 [toaster-b (takeover):
scsitarget.ispfct.targetReset:notice]: FCP Target 0b: Target was Reset
by the Initiator at Port Id: 0x10000 (WWPN 2100001b3201ba33)
1    sys    Error    custard-1    2010-02-07 21:48:51    1209   
ClusDisk        Cluster service is requesting a bus reset for device
DeviceClusDisk0.
1    sys    Error    custard-1    2010-02-07 21:48:53    1118   
ClusNet        Cluster service was terminated as requested by Node 2.
1    sys    Error    custard-1    2010-02-07 21:48:53    7031    Service
Control Manager        The Cluster Service service terminated
unexpectedly.  It has done this 1 time(s).  The following corrective
action will be taken in 60000 milliseconds: Restart the service.
5    sys    Error    custard-1    2010-02-07 21:48:54    1036   
ClusSvc        Cluster disk resource '' did not respond to a SCSI
maintenance command.
1    sys    Error    custard-1    2010-02-07 21:48:54    1215   
ClusSvc        Cluster Network Name custard is no longer registered with
its hosting system. The associated resource name is ''.
1    sys    Error    custard-1    2010-02-07 21:48:54    1077   
ClusSvc        The TCP/IP interface for Cluster IP Address '' has failed.
1    sys    Warning    custard-1    2010-02-07 21:48:56    50    Ntfs   
    {Delayed Write Failed} Windows was unable to save all the data for
the file . The data has been lost. This error may be caused by a failure
of your computer hardware or network connection. Please try to save this
file elsewhere.
1    sys    Warning    custard-1    2010-02-07 21:48:57    57   
Ftdisk        The system failed to flush data to the transaction log.
Corruption may occur.