Hi Raj/Mike

Thanks for the tip; I am surprised the limit caused a replication in progress to be killed (presumably for a new replication or for a scheduled replication).  If this turns out to be the root cause if might be worth asking Netapp to write a bug for the lack of clear error message.  Something like "Snapmirror replication limit exceeded." would make it much less challenging ;-)

cheers, Kenneth


> Date: Mon, 5 May 2008 08:40:08 +1200
> From: phigmov@gmail.com
> To: mpartyka@acmn.com
> Subject: Re: Oddball SnapMirror issue - Status: Pending with restart checkpoint
> CC: kheal@hotmail.com; tmacmd@gmail.com; owner-toasters@mathworks.com; toasters@mathworks.com
>
> Bill Holland pointed me to this link which might be of use to you
>
> http://now.netapp.com/NOW/knowledge/docs/ontap/rel724/html/ontap/onlinebk/4mirror3.htm
>
> In my case I've staggered the mirror several hours apart so they
> shouldn't kick off simultaneously - I was actually reasonably suprised
> (I guess I shouldn't have been) that there was a limit at all.
>
> The other thread mentioned running a wafl_iron type command to check
> the source - is there anything else on the source that could affect
> establishing a new mirror ? Old snaps ? Old mirrors ? Snap schedules
> etc ?
>
> Don't suppose anyone has a definitive way of re-establishing a mirror
> over a suspect connection (surely if I throttle the bandwidth it
> should just take its time to establish a baseline) ?
>
> Cheers,
> Raj.
>
> On Mon, May 5, 2008 at 7:11 AM, Mike Partyka <mpartyka@acmn.com> wrote:
> >
> >
> >
> >
> > Yeah, I was thinking the same thing, a packet trace but I am waiting for
> > support to come to the same conclusion. After the upgrade yesterday morning
> > I decided I was stumped and opened a ticket this morning. They are
> > currently looking into the problem. Hopefully I'll hear back today sometime
> > and I will share what the list what the eventual resolution is.
> >
> >
> >
> > Regards
> >
> > Mike
> >
> >
> >
> >
> >
> > From: Kenneth Heal [mailto:kheal@hotmail.com]
> > Sent: Sunday, May 04, 2008 2:07 PM
> >
> >
> > To: Mike Partyka; tmacmd@gmail.com; owner-toasters@mathworks.com; Raj
> > Patel; NetApp Toasters List
> > Subject: RE: Oddball SnapMirror issue - Status: Pending with restart
> > checkpoint
> >
> >
> >
> >
> >
> > Hi Mike,
> >
> > Thx for the quick reply. That does indeed shoot my theory/hope out the
> > water. And I am inclined to agree that going lower on the window size is
> > not likely to help, especially as both your boxes are in the same datacentre
> > without any nasty firewalls or WAN links in between them. This is also the
> > window size recommended in the kb for such problems.
> >
> >
> > At this I would be inclined to take a packet trace, fire off ASUPs, open a
> > support case and upload a gzipped copy of the pktt trace. Have to give
> > myself beat on this one... though I would be keen to know what the eventual
> > resolution is.
> >
> > cheers, Kenneth
> >
> > https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb17202
> > ________________________________
> >
> >
> > Subject: RE: Oddball SnapMirror issue - Status: Pending with restart
> > checkpoint
> > Date: Sun, 4 May 2008 13:56:45 -0500
> > From: mpartyka@acmn.com
> > To: kheal@hotmail.com; tmacmd@gmail.com; owner-toasters@mathworks.com;
> > phigmov@gmail.com; toasters@mathworks.com
> >
> >
> > After failing to get the initialization going on the 270 and 3050 (running
> > 7.0.5 and 7.0.6 respectively) yesterday morning we upgraded both the filers
> > (src and dst) to 7.2.4. I immediately after tried the mirror again but no
> > dice the error occurs around the same place/time in the initialization.
> >
> >
> >
> > I did miss the following error in the /etc/messages file:
> >
> >
> >
> > Sat May 3 11:51:23 CDT [worker_thread_98:notice]: snapmirror: Message from
> > Read Socket : Connection
> >
> > Sat May 3 11:51:23 CDT [snapmirror.dst.err:error]: SnapMirror destination
> > transfer from 10.0.10.238data : snapmirror transfer failed to complete.
> >
> > Sat May 3 11:51:24 CDT [snapmirror.dst.err:error]: SnapMirror destination
> > transfer from 10.0.10.238data : snapmirror transfer failed to complete.
> >
> >
> >
> > I understand this might mean the snapmirror.window_size is too large but
> > it's set 32768 which is pretty small already. Usually you increase this
> > value to increase performance but I don't think I want to go much smaller
> > than this.
> >
> >
> >
> >
> >
> > From: Kenneth Heal [mailto:kheal@hotmail.com]
> > Sent: Sunday, May 04, 2008 1:48 PM
> > To: Mike Partyka; tmacmd@gmail.com; owner-toasters@mathworks.com; Raj
> > Patel; NetApp Toasters List
> > Subject: RE: Oddball SnapMirror issue - Status: Pending with restart
> > checkpoint
> >
> >
> >
> > Hi all
> >
> > I don't see a bug which is a precise match to this, but I do see that both
> > scenarios were using 7.0.x releases, and I see a fair few SnapMirror bugs
> > have been fixed in 7.2.4; so I am wondering if in either of the scenarios it
> > is possible to move both filers to 7.2.4 (I semi-fear it isn't especially
> > for the source filers concerned) and/or if anyone has seen this on a 7.2.x
> > release.
> >
> > cheers
> > Kenneth
> >
> >
> >
> > http://now.netapp.com/NOW/cgi-bin/relcmp.on?&rrel=7.0.6&rrel=7.2.4&what=fix
> > ________________________________
> >
> >
> > > Subject: RE: Oddball SnapMirror issue
> > > Date: Sun, 4 May 2008 13:24:05 -0500
> > > From: mpartyka@acmn.com
> > > To: tmacmd@gmail.com; owner-toasters@mathworks.com; phigmov@gmail.com;
> > toasters@mathworks.com
> > >
> > > Is there any reason to prefer wafliron over WAFL_check? Sounds like they
> > > do the same thing but you have the option to only check not
> > > automatically fix with WAFL_check.
> > >
> > > -Mike
> > >
> > > -----Original Message-----
> > > From: tmacmd@gmail.com [mailto:tmacmd@gmail.com]
> > > Sent: Sunday, May 04, 2008 12:59 PM
> > > To: Mike Partyka; owner-toasters@mathworks.com; Raj Patel; NetApp
> > > Toasters List
> > > Subject: Re: Oddball SnapMirror issue
> > >
> > > I would try a wafl iron on the source volume/aggr
> > >
> > > Just because you do not see any filesystem problems, does not mean ther
> > > are not any.
> > >
> > > --tmac
> > >
> > > Sent from my Verizon Wireless BlackBerry
> > >
> > > -----Original Message-----
> > > From: "Mike Partyka" <mpartyka@acmn.com>
> > >
> > > Date: Sun, 4 May 2008 09:28:18
> > > To:"Raj Patel" <phigmov@gmail.com>, <toasters@mathworks.com>
> > > Subject: RE: Oddball SnapMirror issue
> > >
> > >
> > > I'm having a similar experience trying to setup a Snapmirror between a
> > > pair of filers in the same datacenter (Not separated by a firewall). The
> > > source is a 3050 running DOT 7.0.5 and the destination is a 270 running
> > > 7.0.6. The volume is a 420G volume serving unstructured CIFS data. When
> > > I start the initialize everything works fine until it gets to about 82
> > > or 83G, then the initialize aborts. The log contains some very
> > > non-specific messages, here is the current snapmirror log:
> > >
> > > sys Sat May 3 09:12:55 CDT SnapMirror_off (shutdown)
> > > log Sat May 3 09:15:31 CDT FILER_REBOOTED
> > > sys Sat May 3 09:15:34 CDT SnapMirror_on (registry)
> > > dst Sat May 3 10:09:36 CDT 10.0.10.238:data hci2:rcv_data Request
> > > (Initialize)
> > > dst Sat May 3 10:09:42 CDT 10.0.10.238:data hci2:rcv_data Start
> > > dst Sat May 3 11:51:24 CDT 10.0.10.238:data hci2:rcv_data Abort
> > > (snapmirror transfer failed to complete)
> > >
> > > Just as the Raj says when it fails to initialize the destination volume
> > > is in limbo, you can't online it due to the failed initialize. Here is
> > > the error:
> > >
> > > vol online: Volume 'rcv_data' was left in an inconsistent state by an
> > > aborted vol copy or an aborted snapmirror initial (level 0) transfer.
> > > In order to bring it online, you must either destroy and re-create
> > > the volume, or complete an initial snapmirror transfer or vol copy.
> > >
> > > I have considered running WAFL_check but WAFL isn't reporting an
> > > inconsistent state so i'm not sure that would be very effective.
> > > Yesterday I upgraded both filers to DOT 7.2.4 and updated all firmware
> > > then retried with the exact same results.
> > >
> > > The only thing I can think of doing now is running a packet capture on
> > > the filer while it runs and see what that tells me.
> > >
> > > -Mike
> > >
> > > -----Original Message-----
> > > From: owner-toasters@mathworks.com [mailto:owner-toasters@mathworks.com]
> > > On Behalf Of Raj Patel
> > > Sent: Sunday, May 04, 2008 1:29 AM
> > > To: George T Chen
> > > Cc: toasters@mathworks.com
> > > Subject: Re: Oddball SnapMirror issue
> > >
> > > Hi George,
> > >
> > > The working transfers do just update 10 to 20Mb - very small turnover.
> > >
> > > Unfortunately the two I need to mirror are from scratch - no baseline
> > > snapshot. The checkpoint restart occurring during the initialisation
> > > phase. Once the initialisation phase stalls further updates fail as
> > > the volume is not online (obviusly because the init failed).
> > >
> > > I tried setting a once-a-day schedule at a particular time so it
> > > wouldn't trip over itself or other snapmirror operations to no avail.
> > >
> > > As other volumes are updating with small update it made me wonder if
> > > it wasn't the router ipsec tunnel or firewall prematurely closing a
> > > connection for a large baseline transfer.
> > >
> > > I'll attach the log & config when I get back into work.
> > >
> > > Cheers,
> > > Raj.
> > >
> > > On Sun, May 4, 2008 at 4:36 PM, George T Chen <gtchen@yahoo-inc.com>
> > > wrote:
> > > > Since you have one volume already transferring, then there's no
> > > network
> > > > or firewall issue--any problem at that level would affect all
> > > volumes,
> > > > not just a few.
> > > >
> > > > A "Pending with restart checkpoint" appears you abort an ongoing
> > > > transfer. Checkpoint occur every ?? megabytes and gives Ontap a
> > > place
> > > > to restart instead of from scratch. It's hard to debug without more
> > > > info, but I would start by:
> > > >
> > > > 1) doing a snapmirror break on the volume (not just an abort)
> > > > 2) verify that there is a common baseline snapshot on both source and
> > > > destination
> > > > 3) restart with a snapmirror resync command
> > > >
> > > > Depending on step 2, you may be required to go to a snapmirror
> > > > initialize.
> > > >
> > > > What do the /etc/log/snapmirror and /etc/messages file say?
> > > >
> > > > -gtchen
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: owner-toasters@mathworks.com
> > > > [mailto:owner-toasters@mathworks.com]
> > > > > On Behalf Of Raj Patel
> > > > > Sent: Saturday, May 03, 2008 2:00 AM
> > > > > To: toasters@mathworks.com
> > > > > Subject: Oddball SnapMirror issue
> > > > >
> > > > > We've got two FAS 270's in different cities. They're connected by a
> > > > > 10mb pipe with routers (running ipsec) & firewalls (checkpoint
> > > splat)
> > > > > seperating each datacenter.
> > > > >
> > > > > The primary san is fine and runs all our prod volumes (7.0.5) which
> > > > > are mirrored to our secondary san (7.0.6).
> > > > >
> > > > > Recently I had to recreate the mirror relationship for some volumes
> > > as
> > > > > they'd fallen far out of sync due to some firewall work.
> > > > >
> > > > > What I am seeing is one volume is syncing fine, one has a small lag
> > > > > and two are stuck with a status of 'Pending with restart
> > > checkpoint'
> > > > > after I re-initialised the transfer.
> > > > >
> > > > > snapmirror status -l shows this for one of the two that just don't
> > > get
> > > > > properly initialised
> > > > >
> > > > > Source: 10.1.45.7:sqlprod01
> > > > > Destination: adcsan1:sqlprod01_mirror
> > > > > Status: Pending with restart checkpoint
> > > > > Progress: 38376 KB
> > > > > State: Unknown
> > > > > Lag: -
> > > > > Mirror Timestamp: -
> > > > > Base Snapshot: -
> > > > > Current Transfer Type: Retry
> > > > > Current Transfer Error: volume is not online; cannot execute
> > > operation
> > > > > Contents: -
> > > > > Last Transfer Type: -
> > > > > Last Transfer Size: -
> > > > > Last Transfer Duration: -
> > > > > Last Transfer From: -
> > > > >
> > > > > Our firewalls rules have been relaxed to allow free-flow between
> > > these
> > > > > devices (instead of just the SnapMirror ports) and the routers and
> > > > > circuit haven't changed at all between it working fine and not
> > > working
> > > > > now. The volume that is mirroring OK seems fine and still syncs
> > > fine -
> > > > > granted the updates are small whereas the three non-working volumes
> > > > > have to sync quite a lot of data.
> > > > >
> > > > > I've tried deleting the mirrored volumes, recreating them, setting
> > > up
> > > > > the mirror relationship again (with a variety of scheduling and
> > > > > bandwidth throttling options) and doing a destination SAN reboot.
> > > > >
> > > > > What are the best options to troubleshoot this or insuring a
> > > > > successful mirror ? Has anyone had issues with dropped or stalled
> > > > > SnapMirror baseline transfers via an IPSec tunnel or Firewall ?
> > > > >
> > > > > Thanks in advance,
> > > > > Raj.
> > > > >
> > > > > PS As an addendum it looks like it starts a transfer, stalls and
> > > from
> > > > > then on subsequent mirrors fail because its not online (ie the
> > > > > initialisation fails ?)
> > > > >
> > > > > What I don't understand is why it just can't carry on with the
> > > > > initialisation regardless of the interruption by resuming the
> > > mirror
> > > > > operation ?
> > > >
> > >
> > >
> > ________________________________
> >
> >
> > Express yourself instantly with MSN Messenger! MSN Messenger
> >
> >
> > ________________________________
> >
> >
> > Express yourself instantly with MSN Messenger! MSN Messenger


Express yourself instantly with MSN Messenger! MSN Messenger