"snapmirrors would have been running during this time (3 minute intervals to a DR site)"

how many snapmirrors are you running at 3 minute intervals ? Those can a lot of IO and CPs

A while ago we discovered our snapmirrors were accounting for 50% of our IOPs (tuning the schedule fixed that):


We also went through VM alignment on NFS - we were able to quantify the alignment then trend it as we aligned VMs:






On Jul 26, 2013, at 9:43 PM, Chris Picton <chris@picton.nom.za> wrote:

DOT version 8.1.2p4

I have done a 'sis stop /myvol', would restarting it again (sis start -s) lose all existing progress, or can I do a normal sis start and it will continue the scan from previous?

I can see my CP fluctuates between 100% (:f) and 0% (-) in about 10 second cycles (while the dedup is running) CPU usage spikes up to 20% intermittently, but stays low.

CPU    NFS   CIFS   HTTP   Total     Net   kB/s    Disk kB/s    Tape   kB/s  Cache  Cache    CP  CP  Disk   OTHER FCP  iSCSI     FCP   kB/s   iSCSI   kB/s
                                      in    out    read write    read  write    age    hit  time  ty util                            in    out      in    out
 4%    280      0      0     303    5226     92    4116 18152       0      0     6    100%  100%  :f    3%       5 18      0       1      2       0      0
23%   6139      0      0    6217   27307   1638    4748 23752       0      0     0s   100%  100%  :f    6%      72 6      0       1      0       0      0
 8%   1729      0      0    1734    7395   1037    3184 19592       0      0     0s    96%  100%  :f    4%       0 5      0       1      4       0      0
 2%     75      0      0      79     568     89    2100 21972       0      0     0s    99%  100%  :f    4%       0 4      0       1      4       0      0
 4%    193      0      0     194    2554   1265    2364 22648       0      0     0s    93%  100%  :f    4%       0 1      0       1      0       0      0
 4%    208      0      0     215    4954   1706    9580 16776       0      0     0s    91%   72%  :     4%       5 2      0       1      0       0      0
 2%    211      0      0     226    1422   1015 492      0       0      0     0s    97%    0%  -     2% 0     15      0     274    264       0      0
22%   6686      0      0    6715   29299   1740 84     24       0      0     1s   100%    0%  -     2% 0     25      0     276    265       0      0

When I stop the dedup scan for that volume, the snapmirror errors in my logs go away, so I assume they start up. My CPU usage climbs, and  I still have the CP 100% to 0% cycling, but with a few Z states thrown in that weren't there before

CPU    NFS   CIFS   HTTP   Total     Net   kB/s    Disk kB/s    Tape   kB/s  Cache  Cache    CP  CP  Disk   OTHER FCP  iSCSI     FCP   kB/s   iSCSI   kB/s
                                      in    out    read write    read  write    age    hit  time  ty util                            in    out      in    out
10%    307      0      0     308    2900   4825   16875 22066       0      0     0s    98%  100%  Zf    8%       0 1      0       1      0       0      0
 7%    499      0      0     505   10720   5103    3500 34560       0      0     2s    97%  100%  :f    7%       0 6      0       2      4       0      0
14%   3325      0      0    3337   22871   5188     292 22084       0      0     2s    99%  100%  :f    3%       4 8      0       0      5       0      0
 9%    109      0      0     110     522   1474   10156 34608       0      0     2s    94%  100%  Zf    9%       0 1      0       1      0       0      0
12%    157      0      0     160     980   2135   33448 30984       0      0     3s    90%   73%  Z    14%       0 3      0       2      0       0      0
79%    178      0      0     179    2792   5136 176444      0       0      0     0s   100%    0%  -    21% 0      1      0       0      0       0      0
82%    290      0      0     309    5569   5495 155284      0       0      0     0s    99%    0%  -    14% 0     19      0     278    265       0      0
82%   3215      0      0    3241   14575   5853 162609     24       0      0     0s   100%    0%  -    17% 9     13      0       2      1       0      0
88%    130      0      0     130     785   5190 195780      0       0      0     0s   100%    0%  -    20% 0      0      0       0      0       0      0
93%    108      0      0     186     957   5091 202388      8       0      0     0s   100%    0%  -    17% 0     78      0    1564   1454       0      0
94%    185      0      0     188    2551   5058 216120     24       0      0     0s   100%    0%  -    19% 0      3      0       2      0       0      0
21%    519      0      0     519    6321   5028   34816 22612       0      0     0s    99%   90%  Zf    7%       0 0      0       0      0       0      0
13%   3273      0      0    3363   11980   5900    5328 16108       0      0     0s    99%  100%  :f    4%      89 1      0       1      0       0      0

snapmirrors would have been running during this time (3 minute intervals to a DR site), but no other dedup when I started the process

On 2013/07/27 6:21 AM, Fletcher Cocquyt wrote:
Hi Chris -

What version of DOT?

What does a sysstat -x 1 show (CPU and Disk Util wise)?

sysstat -x 1
CPU  NFS   CIFS   HTTP   Total     Net kB/s    Disk   kB/s    Tape   kB/s  Cache  Cache    CP  CP  Disk OTHER    FCP  iSCSI     FCP   kB/s iSCSI   kB/s
                                in    out    read  write    read  write    age    hit  time  ty  util                      in    out  in    out
60% 7904      0      0    7904   31008 330735  232872     24       0      0     1     96%    0%  -    54% 0      0      0       0      0 0      0
51% 7609      0      0    7609    4659 316694  264612      0       0      0     1     96%    0%  -    39% 0      0      0       0      0 0      0
51% 7154      0      0    7159    3812 281360  204592      8       0      0     1     95%    0%  -    48% 5      0      0       0      0 0      0

you can run a sis stop <vol> and re-run the sysstat -x 1 to compare the relative CPU  & Disk Util

Do you have overlapping snapmirror or sis jobs running?
If so, consider staggering their schedules to minimize load.

Fletcher

On Jul 26, 2013, at 9:04 PM, Chris Picton <chris@picton.nom.za <mailto:chris@picton.nom.za>> wrote:

Hi all

One of the volumes exported via NFS from my fas3210 didn't have dedup enabled when comissioned.  It is 250GB, and hosts ploop backed openvz vms.  It is currently using about 210GB, and hourly snapshot size is about 6GB.

When I run sis start -s on this volume, the entire system slows down to a crawl.  My snmp monitoring start timing out, ssh access to the system is hit and miss, taking over a minute to log in, and when logged on, command response is sluggish.  I also get the following error in the logs for all snapmirror pairs

SnapMirror: source transfer from TEST_TESTVOL to xx.yy.zz:TEST_TESTVOL : request denied, previous request still processing.

Fortunately, disk access from clients on this and other volumes are not detrimentally affected, but IO response times do go up by about 100ms.

After running overnight for 11 hours, sis status reports
Progress:                        19333120 KB Scanned
Change Log Usage:                88%
Logical Data:                    151 GB/49 TB (0%)


At this rate, it will take about 5 days to finish scanning, leaving me barely able to manage the system effectively while this is happening.

Is this normal behaviour - do I just have to wait through it, or can I stop it and correct something before trying again.  Also, is the change log filling up towards 100% something to worry about?

Regards
Chris

_______________________________________________
Toasters mailing list
Toasters@teaparty.net <mailto:Toasters@teaparty.net>
http://www.teaparty.net/mailman/listinfo/toasters