Re: Oddball SnapMirror issue - Status: Pending with restart checkpoint

4 May 2008

      If you did reach a limit on simultaneous replications, there will be a message in your syslog stating such.
  ----- Original Message ----- 
  From: Kenneth Heal 
  To: Raj Patel ; Mike Partyka 
  Cc: tmacmd@gmail.com ; owner-toasters@mathworks.com ; NetApp Toasters List 
  Sent: Sunday, May 04, 2008 4:59 PM
  Subject: RE: Oddball SnapMirror issue - Status: Pending with restart checkpoint
Hi Raj/Mike
Thanks for the tip; I am surprised the limit caused a replication in progress to be killed (presumably for a new replication or for a scheduled replication).  If this turns out to be the root cause if might be worth asking Netapp to write a bug for the lack of clear error message.  Something like "Snapmirror replication limit exceeded." would make it much less challenging ;-)
cheers, Kenneth
------------------------------------------------------------------------------
...
Date: Mon, 5 May 2008 08:40:08 +1200
From: phigmov@gmail.com
To: mpartyka@acmn.com
Subject: Re: Oddball SnapMirror issue - Status: Pending with restart checkpoint
CC: kheal@hotmail.com; tmacmd@gmail.com; owner-toasters@mathworks.com; toasters@mathworks.com
Bill Holland pointed me to this link which might be of use to you
http://now.netapp.com/NOW/knowledge/docs/ontap/rel724/html/ontap/onlinebk/4m...
In my case I've staggered the mirror several hours apart so they
shouldn't kick off simultaneously - I was actually reasonably suprised
(I guess I shouldn't have been) that there was a limit at all.
The other thread mentioned running a wafl_iron type command to check
the source - is there anything else on the source that could affect
establishing a new mirror ? Old snaps ? Old mirrors ? Snap schedules
etc ?
Don't suppose anyone has a definitive way of re-establishing a mirror
over a suspect connection (surely if I throttle the bandwidth it
should just take its time to establish a baseline) ?
Cheers,
Raj.
On Mon, May 5, 2008 at 7:11 AM, Mike Partyka mpartyka@acmn.com wrote:
...
Yeah, I was thinking the same thing, a packet trace but I am waiting for
support to come to the same conclusion. After the upgrade yesterday morning
I decided I was stumped and opened a ticket this morning. They are
currently looking into the problem. Hopefully I'll hear back today sometime
and I will share what the list what the eventual resolution is.
Regards
Mike
From: Kenneth Heal [mailto:kheal@hotmail.com]
Sent: Sunday, May 04, 2008 2:07 PM
To: Mike Partyka; tmacmd@gmail.com; owner-toasters@mathworks.com; Raj
Patel; NetApp Toasters List
Subject: RE: Oddball SnapMirror issue - Status: Pending with restart
checkpoint
Hi Mike,
Thx for the quick reply. That does indeed shoot my theory/hope out the
water. And I am inclined to agree that going lower on the window size is
not likely to help, especially as both your boxes are in the same datacentre
without any nasty firewalls or WAN links in between them. This is also the
window size recommended in the kb for such problems.
At this I would be inclined to take a packet trace, fire off ASUPs, open a
support case and upload a gzipped copy of the pktt trace. Have to give
myself beat on this one... though I would be keen to know what the eventual
resolution is.
cheers, Kenneth
https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb17202
________________________________
Subject: RE: Oddball SnapMirror issue - Status: Pending with restart
checkpoint
Date: Sun, 4 May 2008 13:56:45 -0500
From: mpartyka@acmn.com
To: kheal@hotmail.com; tmacmd@gmail.com; owner-toasters@mathworks.com;
phigmov@gmail.com; toasters@mathworks.com
After failing to get the initialization going on the 270 and 3050 (running
7.0.5 and 7.0.6 respectively) yesterday morning we upgraded both the filers
(src and dst) to 7.2.4. I immediately after tried the mirror again but no
dice the error occurs around the same place/time in the initialization.
I did miss the following error in the /etc/messages file:
Sat May 3 11:51:23 CDT [worker_thread_98:notice]: snapmirror: Message from
Read Socket : Connection
Sat May 3 11:51:23 CDT [snapmirror.dst.err:error]: SnapMirror destination
transfer from 10.0.10.238data : snapmirror transfer failed to complete.
Sat May 3 11:51:24 CDT [snapmirror.dst.err:error]: SnapMirror destination
transfer from 10.0.10.238data : snapmirror transfer failed to complete.
I understand this might mean the snapmirror.window_size is too large but
it's set 32768 which is pretty small already. Usually you increase this
value to increase performance but I don't think I want to go much smaller
than this.
From: Kenneth Heal [mailto:kheal@hotmail.com]
Sent: Sunday, May 04, 2008 1:48 PM
To: Mike Partyka; tmacmd@gmail.com; owner-toasters@mathworks.com; Raj
Patel; NetApp Toasters List
Subject: RE: Oddball SnapMirror issue - Status: Pending with restart
checkpoint
Hi all
I don't see a bug which is a precise match to this, but I do see that both
scenarios were using 7.0.x releases, and I see a fair few SnapMirror bugs
have been fixed in 7.2.4; so I am wondering if in either of the scenarios it
is possible to move both filers to 7.2.4 (I semi-fear it isn't especially
for the source filers concerned) and/or if anyone has seen this on a 7.2.x
release.
cheers
Kenneth
http://now.netapp.com/NOW/cgi-bin/relcmp.on?&rrel=7.0.6&rrel=7.2.4&a...
________________________________
...
Subject: RE: Oddball SnapMirror issue
Date: Sun, 4 May 2008 13:24:05 -0500
From: mpartyka@acmn.com
To: tmacmd@gmail.com; owner-toasters@mathworks.com; phigmov@gmail.com;
toasters@mathworks.com
...
Is there any reason to prefer wafliron over WAFL_check? Sounds like they
do the same thing but you have the option to only check not
automatically fix with WAFL_check.
-Mike
-----Original Message-----
From: tmacmd@gmail.com [mailto:tmacmd@gmail.com]
Sent: Sunday, May 04, 2008 12:59 PM
To: Mike Partyka; owner-toasters@mathworks.com; Raj Patel; NetApp
Toasters List
Subject: Re: Oddball SnapMirror issue
I would try a wafl iron on the source volume/aggr
Just because you do not see any filesystem problems, does not mean ther
are not any.
--tmac
Sent from my Verizon Wireless BlackBerry
-----Original Message-----
From: "Mike Partyka" mpartyka@acmn.com
Date: Sun, 4 May 2008 09:28:18
To:"Raj Patel" phigmov@gmail.com, toasters@mathworks.com
Subject: RE: Oddball SnapMirror issue
I'm having a similar experience trying to setup a Snapmirror between a
pair of filers in the same datacenter (Not separated by a firewall). The
source is a 3050 running DOT 7.0.5 and the destination is a 270 running
7.0.6. The volume is a 420G volume serving unstructured CIFS data. When
I start the initialize everything works fine until it gets to about 82
or 83G, then the initialize aborts. The log contains some very
non-specific messages, here is the current snapmirror log:
sys Sat May 3 09:12:55 CDT SnapMirror_off (shutdown)
log Sat May 3 09:15:31 CDT FILER_REBOOTED
sys Sat May 3 09:15:34 CDT SnapMirror_on (registry)
dst Sat May 3 10:09:36 CDT 10.0.10.238:data hci2:rcv_data Request
(Initialize)
dst Sat May 3 10:09:42 CDT 10.0.10.238:data hci2:rcv_data Start
dst Sat May 3 11:51:24 CDT 10.0.10.238:data hci2:rcv_data Abort
(snapmirror transfer failed to complete)
Just as the Raj says when it fails to initialize the destination volume
is in limbo, you can't online it due to the failed initialize. Here is
the error:
vol online: Volume 'rcv_data' was left in an inconsistent state by an
aborted vol copy or an aborted snapmirror initial (level 0) transfer.
In order to bring it online, you must either destroy and re-create
the volume, or complete an initial snapmirror transfer or vol copy.
I have considered running WAFL_check but WAFL isn't reporting an
inconsistent state so i'm not sure that would be very effective.
Yesterday I upgraded both filers to DOT 7.2.4 and updated all firmware
then retried with the exact same results.
The only thing I can think of doing now is running a packet capture on
the filer while it runs and see what that tells me.
-Mike
-----Original Message-----
From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com]
On Behalf Of Raj Patel
Sent: Sunday, May 04, 2008 1:29 AM
To: George T Chen
Cc: toasters@mathworks.com
Subject: Re: Oddball SnapMirror issue
Hi George,
The working transfers do just update 10 to 20Mb - very small turnover.
Unfortunately the two I need to mirror are from scratch - no baseline
snapshot. The checkpoint restart occurring during the initialisation
phase. Once the initialisation phase stalls further updates fail as
the volume is not online (obviusly because the init failed).
I tried setting a once-a-day schedule at a particular time so it
wouldn't trip over itself or other snapmirror operations to no avail.
As other volumes are updating with small update it made me wonder if
it wasn't the router ipsec tunnel or firewall prematurely closing a
connection for a large baseline transfer.
I'll attach the log & config when I get back into work.
Cheers,
Raj.
On Sun, May 4, 2008 at 4:36 PM, George T Chen gtchen@yahoo-inc.com
wrote:
...
Since you have one volume already transferring, then there's no
network
...
or firewall issue--any problem at that level would affect all
volumes,
...
not just a few.
A "Pending with restart checkpoint" appears you abort an ongoing
transfer. Checkpoint occur every ?? megabytes and gives Ontap a
place
...
to restart instead of from scratch. It's hard to debug without more
info, but I would start by:

doing a snapmirror break on the volume (not just an abort)
verify that there is a common baseline snapshot on both source and

destination
3) restart with a snapmirror resync command
Depending on step 2, you may be required to go to a snapmirror
initialize.
What do the /etc/log/snapmirror and /etc/messages file say?
-gtchen
...
-----Original Message-----
From: owner-toasters@mathworks.com
[mailto:owner-toasters@mathworks.com]
...
On Behalf Of Raj Patel
Sent: Saturday, May 03, 2008 2:00 AM
To: toasters@mathworks.com
Subject: Oddball SnapMirror issue
We've got two FAS 270's in different cities. They're connected by a
10mb pipe with routers (running ipsec) & firewalls (checkpoint
splat)
...
...
seperating each datacenter.
The primary san is fine and runs all our prod volumes (7.0.5) which
are mirrored to our secondary san (7.0.6).
Recently I had to recreate the mirror relationship for some volumes
as
...
...
they'd fallen far out of sync due to some firewall work.
What I am seeing is one volume is syncing fine, one has a small lag
and two are stuck with a status of 'Pending with restart
checkpoint'
...
...
after I re-initialised the transfer.
snapmirror status -l shows this for one of the two that just don't
get
...
...
properly initialised
Source: 10.1.45.7:sqlprod01
Destination: adcsan1:sqlprod01_mirror
Status: Pending with restart checkpoint
Progress: 38376 KB
State: Unknown
Lag: -
Mirror Timestamp: -
Base Snapshot: -
Current Transfer Type: Retry
Current Transfer Error: volume is not online; cannot execute
operation
...
...
Contents: -
Last Transfer Type: -
Last Transfer Size: -
Last Transfer Duration: -
Last Transfer From: -
Our firewalls rules have been relaxed to allow free-flow between
these
...
...
devices (instead of just the SnapMirror ports) and the routers and
circuit haven't changed at all between it working fine and not
working
...
...
now. The volume that is mirroring OK seems fine and still syncs
fine -
...
...
granted the updates are small whereas the three non-working volumes
have to sync quite a lot of data.
I've tried deleting the mirrored volumes, recreating them, setting
up
...
...
the mirror relationship again (with a variety of scheduling and
bandwidth throttling options) and doing a destination SAN reboot.
What are the best options to troubleshoot this or insuring a
successful mirror ? Has anyone had issues with dropped or stalled
SnapMirror baseline transfers via an IPSec tunnel or Firewall ?
Thanks in advance,
Raj.
PS As an addendum it looks like it starts a transfer, stalls and
from
...
...
then on subsequent mirrors fail because its not online (ie the
initialisation fails ?)
What I don't understand is why it just can't carry on with the
initialisation regardless of the interruption by resuming the
mirror
...
...
operation ?

Express yourself instantly with MSN Messenger! MSN Messenger

Express yourself instantly with MSN Messenger! MSN Messenger
------------------------------------------------------------------------------
  Express yourself instantly with MSN Messenger! MSN Messenger

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: Oddball SnapMirror issue - Status: Pending with restart checkpoint