----- Original Message -----
Sent: Sunday, May 04, 2008 4:59 PM
Subject: RE: Oddball SnapMirror issue -
Status: Pending with restart checkpoint
Hi Raj/Mike
Thanks for the tip; I am
surprised the limit caused a replication in progress to be killed (presumably
for a new replication or for a scheduled replication). If this turns out
to be the root cause if might be worth asking Netapp to write a bug for the
lack of clear error message. Something like "Snapmirror replication
limit exceeded." would make it much less challenging ;-)
cheers,
Kenneth
> Date: Mon, 5 May 2008 08:40:08 +1200
> From: phigmov@gmail.com
> To: mpartyka@acmn.com
> Subject: Re:
Oddball SnapMirror issue - Status: Pending with restart checkpoint
> CC:
kheal@hotmail.com; tmacmd@gmail.com; owner-toasters@mathworks.com;
toasters@mathworks.com
>
> Bill Holland pointed me to this link
which might be of use to you
>
>
http://now.netapp.com/NOW/knowledge/docs/ontap/rel724/html/ontap/onlinebk/4mirror3.htm
>
> In my case I've staggered the mirror several hours apart so
they
> shouldn't kick off simultaneously - I was actually reasonably
suprised
> (I guess I shouldn't have been) that there was a limit at
all.
>
> The other thread mentioned running a wafl_iron type
command to check
> the source - is there anything else on the source
that could affect
> establishing a new mirror ? Old snaps ? Old mirrors
? Snap schedules
> etc ?
>
> Don't suppose anyone has a
definitive way of re-establishing a mirror
> over a suspect connection
(surely if I throttle the bandwidth it
> should just take its time to
establish a baseline) ?
>
> Cheers,
> Raj.
>
>
On Mon, May 5, 2008 at 7:11 AM, Mike Partyka <mpartyka@acmn.com>
wrote:
> >
> >
> >
> >
> > Yeah,
I was thinking the same thing, a packet trace but I am waiting for
>
> support to come to the same conclusion. After the upgrade yesterday
morning
> > I decided I was stumped and opened a ticket this morning.
They are
> > currently looking into the problem. Hopefully I'll hear
back today sometime
> > and I will share what the list what the
eventual resolution is.
> >
> >
> >
> >
Regards
> >
> > Mike
> >
> >
>
>
> >
> >
> > From: Kenneth Heal
[mailto:kheal@hotmail.com]
> > Sent: Sunday, May 04, 2008 2:07
PM
> >
> >
> > To: Mike Partyka; tmacmd@gmail.com;
owner-toasters@mathworks.com; Raj
> > Patel; NetApp Toasters
List
> > Subject: RE: Oddball SnapMirror issue - Status: Pending with
restart
> > checkpoint
> >
> >
> >
>
>
> >
> > Hi Mike,
> >
> > Thx for the
quick reply. That does indeed shoot my theory/hope out the
> > water.
And I am inclined to agree that going lower on the window size is
> >
not likely to help, especially as both your boxes are in the same
datacentre
> > without any nasty firewalls or WAN links in between
them. This is also the
> > window size recommended in the kb for such
problems.
> >
> >
> > At this I would be inclined
to take a packet trace, fire off ASUPs, open a
> > support case and
upload a gzipped copy of the pktt trace. Have to give
> > myself beat
on this one... though I would be keen to know what the eventual
> >
resolution is.
> >
> > cheers, Kenneth
> >
>
> https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb17202
>
> ________________________________
> >
> >
> >
Subject: RE: Oddball SnapMirror issue - Status: Pending with restart
>
> checkpoint
> > Date: Sun, 4 May 2008 13:56:45 -0500
> >
From: mpartyka@acmn.com
> > To: kheal@hotmail.com; tmacmd@gmail.com;
owner-toasters@mathworks.com;
> > phigmov@gmail.com;
toasters@mathworks.com
> >
> >
> > After failing to
get the initialization going on the 270 and 3050 (running
> > 7.0.5
and 7.0.6 respectively) yesterday morning we upgraded both the filers
>
> (src and dst) to 7.2.4. I immediately after tried the mirror again but
no
> > dice the error occurs around the same place/time in the
initialization.
> >
> >
> >
> > I did miss
the following error in the /etc/messages file:
> >
>
>
> >
> > Sat May 3 11:51:23 CDT
[worker_thread_98:notice]: snapmirror: Message from
> > Read Socket :
Connection
> >
> > Sat May 3 11:51:23 CDT
[snapmirror.dst.err:error]: SnapMirror destination
> > transfer from
10.0.10.238data : snapmirror transfer failed to complete.
> >
>
> Sat May 3 11:51:24 CDT [snapmirror.dst.err:error]: SnapMirror
destination
> > transfer from 10.0.10.238data : snapmirror transfer
failed to complete.
> >
> >
> >
> > I
understand this might mean the snapmirror.window_size is too large but
>
> it's set 32768 which is pretty small already. Usually you increase
this
> > value to increase performance but I don't think I want to go
much smaller
> > than this.
> >
> >
>
>
> >
> >
> > From: Kenneth Heal
[mailto:kheal@hotmail.com]
> > Sent: Sunday, May 04, 2008 1:48
PM
> > To: Mike Partyka; tmacmd@gmail.com;
owner-toasters@mathworks.com; Raj
> > Patel; NetApp Toasters
List
> > Subject: RE: Oddball SnapMirror issue - Status: Pending with
restart
> > checkpoint
> >
> >
> >
>
> Hi all
> >
> > I don't see a bug which is a precise
match to this, but I do see that both
> > scenarios were using 7.0.x
releases, and I see a fair few SnapMirror bugs
> > have been fixed in
7.2.4; so I am wondering if in either of the scenarios it
> > is
possible to move both filers to 7.2.4 (I semi-fear it isn't especially
>
> for the source filers concerned) and/or if anyone has seen this on a
7.2.x
> > release.
> >
> > cheers
> >
Kenneth
> >
> >
> >
> >
http://now.netapp.com/NOW/cgi-bin/relcmp.on?&rrel=7.0.6&rrel=7.2.4&what=fix
>
> ________________________________
> >
> >
> >
> Subject: RE: Oddball SnapMirror issue
> > > Date: Sun, 4 May
2008 13:24:05 -0500
> > > From: mpartyka@acmn.com
> >
> To: tmacmd@gmail.com; owner-toasters@mathworks.com;
phigmov@gmail.com;
> > toasters@mathworks.com
> >
>
> > > Is there any reason to prefer wafliron over WAFL_check?
Sounds like they
> > > do the same thing but you have the option
to only check not
> > > automatically fix with WAFL_check.
>
> >
> > > -Mike
> > >
> > >
-----Original Message-----
> > > From: tmacmd@gmail.com
[mailto:tmacmd@gmail.com]
> > > Sent: Sunday, May 04, 2008 12:59
PM
> > > To: Mike Partyka; owner-toasters@mathworks.com; Raj
Patel; NetApp
> > > Toasters List
> > > Subject: Re:
Oddball SnapMirror issue
> > >
> > > I would try a
wafl iron on the source volume/aggr
> > >
> > > Just
because you do not see any filesystem problems, does not mean ther
>
> > are not any.
> > >
> > > --tmac
> >
>
> > > Sent from my Verizon Wireless BlackBerry
> >
>
> > > -----Original Message-----
> > > From:
"Mike Partyka" <mpartyka@acmn.com>
> > >
> > >
Date: Sun, 4 May 2008 09:28:18
> > > To:"Raj Patel"
<phigmov@gmail.com>, <toasters@mathworks.com>
> > >
Subject: RE: Oddball SnapMirror issue
> > >
> >
>
> > > I'm having a similar experience trying to setup a
Snapmirror between a
> > > pair of filers in the same datacenter
(Not separated by a firewall). The
> > > source is a 3050 running
DOT 7.0.5 and the destination is a 270 running
> > > 7.0.6. The
volume is a 420G volume serving unstructured CIFS data. When
> > >
I start the initialize everything works fine until it gets to about 82
>
> > or 83G, then the initialize aborts. The log contains some
very
> > > non-specific messages, here is the current snapmirror
log:
> > >
> > > sys Sat May 3 09:12:55 CDT
SnapMirror_off (shutdown)
> > > log Sat May 3 09:15:31 CDT
FILER_REBOOTED
> > > sys Sat May 3 09:15:34 CDT SnapMirror_on
(registry)
> > > dst Sat May 3 10:09:36 CDT 10.0.10.238:data
hci2:rcv_data Request
> > > (Initialize)
> > > dst Sat
May 3 10:09:42 CDT 10.0.10.238:data hci2:rcv_data Start
> > > dst
Sat May 3 11:51:24 CDT 10.0.10.238:data hci2:rcv_data Abort
> > >
(snapmirror transfer failed to complete)
> > >
> > >
Just as the Raj says when it fails to initialize the destination
volume
> > > is in limbo, you can't online it due to the failed
initialize. Here is
> > > the error:
> > >
>
> > vol online: Volume 'rcv_data' was left in an inconsistent state by
an
> > > aborted vol copy or an aborted snapmirror initial (level
0) transfer.
> > > In order to bring it online, you must either
destroy and re-create
> > > the volume, or complete an initial
snapmirror transfer or vol copy.
> > >
> > > I have
considered running WAFL_check but WAFL isn't reporting an
> > >
inconsistent state so i'm not sure that would be very effective.
> >
> Yesterday I upgraded both filers to DOT 7.2.4 and updated all
firmware
> > > then retried with the exact same results.
>
> >
> > > The only thing I can think of doing now is running
a packet capture on
> > > the filer while it runs and see what
that tells me.
> > >
> > > -Mike
> >
>
> > > -----Original Message-----
> > > From:
owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com]
>
> > On Behalf Of Raj Patel
> > > Sent: Sunday, May 04, 2008
1:29 AM
> > > To: George T Chen
> > > Cc:
toasters@mathworks.com
> > > Subject: Re: Oddball SnapMirror
issue
> > >
> > > Hi George,
> > >
>
> > The working transfers do just update 10 to 20Mb - very small
turnover.
> > >
> > > Unfortunately the two I need to
mirror are from scratch - no baseline
> > > snapshot. The
checkpoint restart occurring during the initialisation
> > >
phase. Once the initialisation phase stalls further updates fail as
>
> > the volume is not online (obviusly because the init failed).
>
> >
> > > I tried setting a once-a-day schedule at a
particular time so it
> > > wouldn't trip over itself or other
snapmirror operations to no avail.
> > >
> > > As
other volumes are updating with small update it made me wonder if
> >
> it wasn't the router ipsec tunnel or firewall prematurely closing
a
> > > connection for a large baseline transfer.
> >
>
> > > I'll attach the log & config when I get back into
work.
> > >
> > > Cheers,
> > >
Raj.
> > >
> > > On Sun, May 4, 2008 at 4:36 PM,
George T Chen <gtchen@yahoo-inc.com>
> > > wrote:
>
> > > Since you have one volume already transferring, then there's
no
> > > network
> > > > or firewall issue--any
problem at that level would affect all
> > > volumes,
> >
> > not just a few.
> > > >
> > > > A
"Pending with restart checkpoint" appears you abort an ongoing
> >
> > transfer. Checkpoint occur every ?? megabytes and gives Ontap
a
> > > place
> > > > to restart instead of from
scratch. It's hard to debug without more
> > > > info, but I
would start by:
> > > >
> > > > 1) doing a
snapmirror break on the volume (not just an abort)
> > > > 2)
verify that there is a common baseline snapshot on both source and
>
> > > destination
> > > > 3) restart with a snapmirror
resync command
> > > >
> > > > Depending on step
2, you may be required to go to a snapmirror
> > > >
initialize.
> > > >
> > > > What do the
/etc/log/snapmirror and /etc/messages file say?
> > > >
>
> > > -gtchen
> > > >
> > > >
>
> > >
> > > > > -----Original Message-----
>
> > > > From: owner-toasters@mathworks.com
> > > >
[mailto:owner-toasters@mathworks.com]
> > > > > On Behalf Of
Raj Patel
> > > > > Sent: Saturday, May 03, 2008 2:00
AM
> > > > > To: toasters@mathworks.com
> > >
> > Subject: Oddball SnapMirror issue
> > > >
>
> > > > > We've got two FAS 270's in different cities.
They're connected by a
> > > > > 10mb pipe with routers
(running ipsec) & firewalls (checkpoint
> > > splat)
>
> > > > seperating each datacenter.
> > > >
>
> > > > > The primary san is fine and runs all our prod
volumes (7.0.5) which
> > > > > are mirrored to our
secondary san (7.0.6).
> > > > >
> > > > >
Recently I had to recreate the mirror relationship for some volumes
>
> > as
> > > > > they'd fallen far out of sync due to
some firewall work.
> > > > >
> > > > >
What I am seeing is one volume is syncing fine, one has a small lag
>
> > > > and two are stuck with a status of 'Pending with
restart
> > > checkpoint'
> > > > > after I
re-initialised the transfer.
> > > > >
> > >
> > snapmirror status -l shows this for one of the two that just
don't
> > > get
> > > > > properly
initialised
> > > > >
> > > > > Source:
10.1.45.7:sqlprod01
> > > > > Destination:
adcsan1:sqlprod01_mirror
> > > > > Status: Pending with
restart checkpoint
> > > > > Progress: 38376 KB
> >
> > > State: Unknown
> > > > > Lag: -
> >
> > > Mirror Timestamp: -
> > > > > Base Snapshot:
-
> > > > > Current Transfer Type: Retry
> > >
> > Current Transfer Error: volume is not online; cannot execute
>
> > operation
> > > > > Contents: -
> > >
> > Last Transfer Type: -
> > > > > Last Transfer
Size: -
> > > > > Last Transfer Duration: -
> >
> > > Last Transfer From: -
> > > > >
> >
> > > Our firewalls rules have been relaxed to allow free-flow
between
> > > these
> > > > > devices (instead
of just the SnapMirror ports) and the routers and
> > > > >
circuit haven't changed at all between it working fine and not
> >
> working
> > > > > now. The volume that is mirroring OK
seems fine and still syncs
> > > fine -
> > > >
> granted the updates are small whereas the three non-working
volumes
> > > > > have to sync quite a lot of data.
>
> > > >
> > > > > I've tried deleting the
mirrored volumes, recreating them, setting
> > > up
> >
> > > the mirror relationship again (with a variety of scheduling
and
> > > > > bandwidth throttling options) and doing a
destination SAN reboot.
> > > > >
> > > >
> What are the best options to troubleshoot this or insuring a
> >
> > > successful mirror ? Has anyone had issues with dropped or
stalled
> > > > > SnapMirror baseline transfers via an IPSec
tunnel or Firewall ?
> > > > >
> > > > >
Thanks in advance,
> > > > > Raj.
> > > >
>
> > > > > PS As an addendum it looks like it starts a
transfer, stalls and
> > > from
> > > > > then
on subsequent mirrors fail because its not online (ie the
> > >
> > initialisation fails ?)
> > > > >
> >
> > > What I don't understand is why it just can't carry on with
the
> > > > > initialisation regardless of the interruption
by resuming the
> > > mirror
> > > > > operation
?
> > > >
> > >
> > >
> >
________________________________
> >
> >
> >
Express yourself instantly with MSN Messenger! MSN Messenger
>
>
> >
> > ________________________________
>
>
> >
> > Express yourself instantly with MSN Messenger!
MSN Messenger
Express yourself instantly with MSN Messenger! MSN
Messenger