Hi all,
I've got some NetApps which are snapvaulting across a large fat network. But I'm seeing a huge lag in my snapvaults and they're all mostly in the quiescing state, with a lag upto 300 hours now. Sigh...
Is there any way to speed up things? Or to monitor the actual state of the quiescing stage so I can hopefully figure out what the heck is going on here?
Here's a status from the secondary side of things:
Snapvault secondary is ON.
Source: marlfs1:/vol/data01/narad Destination: sjdr0:/vol/marlfs1_data01/narad Status: Quiescing Progress: - State: Snapvaulted Lag: 307:12:16 Mirror Timestamp: Fri Mar 24 14:00:34 PST 2006 Base Snapshot: sjdr0(0050404441)_marlfs1_data01-base.168 Current Transfer Type: - Current Transfer Error: - Contents: Transitioning Last Transfer Type: Retry Last Transfer Size: 13012 KB Last Transfer Duration: 00:06:45 Last Transfer From: marlfs1:/vol/data01/narad
Here's the status from the source side:
Source: marlfs1:/vol/data01/narad Destination: sjdr0:/vol/marlfs1_data01/narad Status: Idle Progress: - State: Source Lag: 145:16:04 Mirror Timestamp: Fri Mar 31 11:00:02 EST 2006 Base Snapshot: sv_hourly.2 Current Transfer Type: - Current Transfer Error: - Contents: - Last Transfer Type: - Last Transfer Size: 13012 KB Last Transfer Duration: 00:06:45 Last Transfer From: -
Thanks, John
Guys,
Because I'm an idiot, I didn't even bother to provide useful details on my Filers. Basically, we've got an R200 on one coast with OnTap 7.0.3P3 installed. I've got an FAS960 also running 7.0.3P3 on the other coast. The 960 is snapvaulting a bunch of qtrees to the R200.
So after I sent out my plea for help (and opened a bug with NetApp) I tried doing something silly. I noticed that the snapvault retries were set to 2, the default. This is with:
r200> snapvault modify <dest>
So I decided, since I had nothing to lose, that maybe I should raise the retry limit up, since we have been doing various WAN network tests and that might have broken things. So I did:
r200> snapvault modifty -t 10 <dest>
It hung for a while, say 30-60 seconds. Made me wonder if I had hosed the R200 and made it reboot or something. Came back and my status had changed from "Quiescing" to "Idle". Excellent!
So I was then able to do:
r200> snapvault update <dest>
And hey, it started transfering data again. All the other snapvaults seemed to be stuck still, but this one was going again. Gave it about 5-10 or 15 minutes and then I noticed that <dest> was stuck, but this time with a Status of "Pending" and an Error message of "too many active transfers at once on the source". Looking at the FAS960, all the other stuck snapvaults were now transfering data to the R200.
Happy!
It's still too early to know if this will actually help me, since none of them have dropped down their lag time, though some are back into Quiescing state. We'll see what happens.
Hmm... some are Idle on the Source, but not on the Destination. So it looks like we've managed to kick them up again. Very good.
John