"snapmirrors would have been running during this time (3 minute intervals to a DR site)"
how many snapmirrors are you running at 3 minute intervals ? Those can a lot of IO and CPs
A while ago we discovered our snapmirrors were accounting for 50% of our IOPs (tuning the schedule fixed that):
http://www.vmadmin.info/2010/07/vmware-and-netapp-deconstructing.html
We also went through VM alignment on NFS - we were able to quantify the alignment then trend it as we aligned VMs:
http://www.vmadmin.info/2010/07/quantifying-vmdk-misalignment.html
On Jul 26, 2013, at 9:43 PM, Chris Picton chris@picton.nom.za wrote:
DOT version 8.1.2p4
I have done a 'sis stop /myvol', would restarting it again (sis start -s) lose all existing progress, or can I do a normal sis start and it will continue the scan from previous?
I can see my CP fluctuates between 100% (:f) and 0% (-) in about 10 second cycles (while the dedup is running) CPU usage spikes up to 20% intermittently, but stays low.
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 4% 280 0 0 303 5226 92 4116 18152 0 0 6 100% 100% :f 3% 5 18 0 1 2 0 0 23% 6139 0 0 6217 27307 1638 4748 23752 0 0 0s 100% 100% :f 6% 72 6 0 1 0 0 0 8% 1729 0 0 1734 7395 1037 3184 19592 0 0 0s 96% 100% :f 4% 0 5 0 1 4 0 0 2% 75 0 0 79 568 89 2100 21972 0 0 0s 99% 100% :f 4% 0 4 0 1 4 0 0 4% 193 0 0 194 2554 1265 2364 22648 0 0 0s 93% 100% :f 4% 0 1 0 1 0 0 0 4% 208 0 0 215 4954 1706 9580 16776 0 0 0s 91% 72% : 4% 5 2 0 1 0 0 0 2% 211 0 0 226 1422 1015 492 0 0 0 0s 97% 0% - 2% 0 15 0 274 264 0 0 22% 6686 0 0 6715 29299 1740 84 24 0 0 1s 100% 0% - 2% 0 25 0 276 265 0 0
When I stop the dedup scan for that volume, the snapmirror errors in my logs go away, so I assume they start up. My CPU usage climbs, and I still have the CP 100% to 0% cycling, but with a few Z states thrown in that weren't there before
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 10% 307 0 0 308 2900 4825 16875 22066 0 0 0s 98% 100% Zf 8% 0 1 0 1 0 0 0 7% 499 0 0 505 10720 5103 3500 34560 0 0 2s 97% 100% :f 7% 0 6 0 2 4 0 0 14% 3325 0 0 3337 22871 5188 292 22084 0 0 2s 99% 100% :f 3% 4 8 0 0 5 0 0 9% 109 0 0 110 522 1474 10156 34608 0 0 2s 94% 100% Zf 9% 0 1 0 1 0 0 0 12% 157 0 0 160 980 2135 33448 30984 0 0 3s 90% 73% Z 14% 0 3 0 2 0 0 0 79% 178 0 0 179 2792 5136 176444 0 0 0 0s 100% 0% - 21% 0 1 0 0 0 0 0 82% 290 0 0 309 5569 5495 155284 0 0 0 0s 99% 0% - 14% 0 19 0 278 265 0 0 82% 3215 0 0 3241 14575 5853 162609 24 0 0 0s 100% 0% - 17% 9 13 0 2 1 0 0 88% 130 0 0 130 785 5190 195780 0 0 0 0s 100% 0% - 20% 0 0 0 0 0 0 0 93% 108 0 0 186 957 5091 202388 8 0 0 0s 100% 0% - 17% 0 78 0 1564 1454 0 0 94% 185 0 0 188 2551 5058 216120 24 0 0 0s 100% 0% - 19% 0 3 0 2 0 0 0 21% 519 0 0 519 6321 5028 34816 22612 0 0 0s 99% 90% Zf 7% 0 0 0 0 0 0 0 13% 3273 0 0 3363 11980 5900 5328 16108 0 0 0s 99% 100% :f 4% 89 1 0 1 0 0 0
snapmirrors would have been running during this time (3 minute intervals to a DR site), but no other dedup when I started the process
On 2013/07/27 6:21 AM, Fletcher Cocquyt wrote:
Hi Chris -
What version of DOT?
What does a sysstat -x 1 show (CPU and Disk Util wise)?
sysstat -x 1 CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 60% 7904 0 0 7904 31008 330735 232872 24 0 0 1 96% 0% - 54% 0 0 0 0 0 0 0 51% 7609 0 0 7609 4659 316694 264612 0 0 0 1 96% 0% - 39% 0 0 0 0 0 0 0 51% 7154 0 0 7159 3812 281360 204592 8 0 0 1 95% 0% - 48% 5 0 0 0 0 0 0
you can run a sis stop <vol> and re-run the sysstat -x 1 to compare the relative CPU & Disk Util
Do you have overlapping snapmirror or sis jobs running? If so, consider staggering their schedules to minimize load.
Fletcher
On Jul 26, 2013, at 9:04 PM, Chris Picton <chris@picton.nom.za mailto:chris@picton.nom.za> wrote:
Hi all
One of the volumes exported via NFS from my fas3210 didn't have dedup enabled when comissioned. It is 250GB, and hosts ploop backed openvz vms. It is currently using about 210GB, and hourly snapshot size is about 6GB.
When I run sis start -s on this volume, the entire system slows down to a crawl. My snmp monitoring start timing out, ssh access to the system is hit and miss, taking over a minute to log in, and when logged on, command response is sluggish. I also get the following error in the logs for all snapmirror pairs
SnapMirror: source transfer from TEST_TESTVOL to xx.yy.zz:TEST_TESTVOL : request denied, previous request still processing.
Fortunately, disk access from clients on this and other volumes are not detrimentally affected, but IO response times do go up by about 100ms.
After running overnight for 11 hours, sis status reports Progress: 19333120 KB Scanned Change Log Usage: 88% Logical Data: 151 GB/49 TB (0%)
At this rate, it will take about 5 days to finish scanning, leaving me barely able to manage the system effectively while this is happening.
Is this normal behaviour - do I just have to wait through it, or can I stop it and correct something before trying again. Also, is the change log filling up towards 100% something to worry about?
Regards Chris
Toasters mailing list Toasters@teaparty.net mailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters