sis start -s causing system slowdown

List overview All Threads
Download

newer

older

configAdvisor question for CM

need advice on using old FAS3140...

Chris Picton

27 Jul 2013 27 Jul '13

4:04 a.m.

Hi all

One of the volumes exported via NFS from my fas3210 didn't have dedup enabled when comissioned. It is 250GB, and hosts ploop backed openvz vms. It is currently using about 210GB, and hourly snapshot size is about 6GB.

When I run sis start -s on this volume, the entire system slows down to a crawl. My snmp monitoring start timing out, ssh access to the system is hit and miss, taking over a minute to log in, and when logged on, command response is sluggish. I also get the following error in the logs for all snapmirror pairs

SnapMirror: source transfer from TEST_TESTVOL to xx.yy.zz:TEST_TESTVOL : request denied, previous request still processing.

Fortunately, disk access from clients on this and other volumes are not detrimentally affected, but IO response times do go up by about 100ms.

After running overnight for 11 hours, sis status reports Progress: 19333120 KB Scanned Change Log Usage: 88% Logical Data: 151 GB/49 TB (0%)

At this rate, it will take about 5 days to finish scanning, leaving me barely able to manage the system effectively while this is happening.

Is this normal behaviour - do I just have to wait through it, or can I stop it and correct something before trying again. Also, is the change log filling up towards 100% something to worry about?

Regards Chris

Show replies by date

Klise, Steve

27 Jul 27 Jul

4:11 a.m.

If the alignment is ok on your vm's, maybe do a reallocate measure on the vol. I assume there is space in the aggr and vol. . If its not a pain, maybe move the data out (storage vmotion), destroy and recreate the vol and dedupe again. Should not take that long in my opinion. Oh, stupid question: your not deduping all the vols at the same time? I would figure not.

----- Original Message ----- From: Chris Picton [mailto:chris@picton.nom.za] Sent: Friday, July 26, 2013 09:04 PM To: toasters@teaparty.net toasters@teaparty.net Subject: sis start -s causing system slowdown

Hi all

SnapMirror: source transfer from TEST_TESTVOL to xx.yy.zz:TEST_TESTVOL : request denied, previous request still processing.

Fortunately, disk access from clients on this and other volumes are not detrimentally affected, but IO response times do go up by about 100ms.

After running overnight for 11 hours, sis status reports Progress: 19333120 KB Scanned Change Log Usage: 88% Logical Data: 151 GB/49 TB (0%)

At this rate, it will take about 5 days to finish scanning, leaving me barely able to manage the system effectively while this is happening.

Is this normal behaviour - do I just have to wait through it, or can I stop it and correct something before trying again. Also, is the change log filling up towards 100% something to worry about?

Regards Chris

_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Chris Picton

4:18 a.m.

I am only doing the single volume right now.

I have free space in the aggr, but only 15% free in the vol - would stopping the dedup process and growing the vol have any benefit? Also, could the reallocate measure be run which the deup scan is in progress? I can consider migrating the openvz vms, but I would prefer to do things on the existing volume..

Thanks Chris

On 2013/07/27 6:11 AM, Klise, Steve wrote:

...

If the alignment is ok on your vm's, maybe do a reallocate measure on the vol. I assume there is space in the aggr and vol. . If its not a pain, maybe move the data out (storage vmotion), destroy and recreate the vol and dedupe again. Should not take that long in my opinion. Oh, stupid question: your not deduping all the vols at the same time? I would figure not.

----- Original Message ----- From: Chris Picton [mailto:chris@picton.nom.za] Sent: Friday, July 26, 2013 09:04 PM To: toasters@teaparty.net toasters@teaparty.net Subject: sis start -s causing system slowdown

Hi all

One of the volumes exported via NFS from my fas3210 didn't have dedup enabled when comissioned. It is 250GB, and hosts ploop backed openvz vms. It is currently using about 210GB, and hourly snapshot size is about 6GB.

When I run sis start -s on this volume, the entire system slows down to a crawl. My snmp monitoring start timing out, ssh access to the system is hit and miss, taking over a minute to log in, and when logged on, command response is sluggish. I also get the following error in the logs for all snapmirror pairs

SnapMirror: source transfer from TEST_TESTVOL to xx.yy.zz:TEST_TESTVOL : request denied, previous request still processing.

Fortunately, disk access from clients on this and other volumes are not detrimentally affected, but IO response times do go up by about 100ms.

After running overnight for 11 hours, sis status reports Progress: 19333120 KB Scanned Change Log Usage: 88% Logical Data: 151 GB/49 TB (0%)

At this rate, it will take about 5 days to finish scanning, leaving me barely able to manage the system effectively while this is happening.

Is this normal behaviour - do I just have to wait through it, or can I stop it and correct something before trying again. Also, is the change log filling up towards 100% something to worry about?

Regards Chris

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Fletcher Cocquyt

4:21 a.m.

Hi Chris -

What version of DOT?

What does a sysstat -x 1 show (CPU and Disk Util wise)?

sysstat -x 1 CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 60% 7904 0 0 7904 31008 330735 232872 24 0 0 1 96% 0% - 54% 0 0 0 0 0 0 0 51% 7609 0 0 7609 4659 316694 264612 0 0 0 1 96% 0% - 39% 0 0 0 0 0 0 0 51% 7154 0 0 7159 3812 281360 204592 8 0 0 1 95% 0% - 48% 5 0 0 0 0 0 0

you can run a sis stop <vol> and re-run the sysstat -x 1 to compare the relative CPU & Disk Util

Do you have overlapping snapmirror or sis jobs running? If so, consider staggering their schedules to minimize load.

Fletcher

On Jul 26, 2013, at 9:04 PM, Chris Picton chris@picton.nom.za wrote:

...

Hi all

One of the volumes exported via NFS from my fas3210 didn't have dedup enabled when comissioned. It is 250GB, and hosts ploop backed openvz vms. It is currently using about 210GB, and hourly snapshot size is about 6GB.

When I run sis start -s on this volume, the entire system slows down to a crawl. My snmp monitoring start timing out, ssh access to the system is hit and miss, taking over a minute to log in, and when logged on, command response is sluggish. I also get the following error in the logs for all snapmirror pairs

SnapMirror: source transfer from TEST_TESTVOL to xx.yy.zz:TEST_TESTVOL : request denied, previous request still processing.

Fortunately, disk access from clients on this and other volumes are not detrimentally affected, but IO response times do go up by about 100ms.

After running overnight for 11 hours, sis status reports Progress: 19333120 KB Scanned Change Log Usage: 88% Logical Data: 151 GB/49 TB (0%)

At this rate, it will take about 5 days to finish scanning, leaving me barely able to manage the system effectively while this is happening.

Is this normal behaviour - do I just have to wait through it, or can I stop it and correct something before trying again. Also, is the change log filling up towards 100% something to worry about?

Regards Chris

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Chris Picton

4:43 a.m.

DOT version 8.1.2p4

I have done a 'sis stop /myvol', would restarting it again (sis start -s) lose all existing progress, or can I do a normal sis start and it will continue the scan from previous?

I can see my CP fluctuates between 100% (:f) and 0% (-) in about 10 second cycles (while the dedup is running) CPU usage spikes up to 20% intermittently, but stays low.

CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 4% 280 0 0 303 5226 92 4116 18152 0 0 6 100% 100% :f 3% 5 18 0 1 2 0 0 23% 6139 0 0 6217 27307 1638 4748 23752 0 0 0s 100% 100% :f 6% 72 6 0 1 0 0 0 8% 1729 0 0 1734 7395 1037 3184 19592 0 0 0s 96% 100% :f 4% 0 5 0 1 4 0 0 2% 75 0 0 79 568 89 2100 21972 0 0 0s 99% 100% :f 4% 0 4 0 1 4 0 0 4% 193 0 0 194 2554 1265 2364 22648 0 0 0s 93% 100% :f 4% 0 1 0 1 0 0 0 4% 208 0 0 215 4954 1706 9580 16776 0 0 0s 91% 72% : 4% 5 2 0 1 0 0 0 2% 211 0 0 226 1422 1015 492 0 0 0 0s 97% 0% - 2% 0 15 0 274 264 0 0 22% 6686 0 0 6715 29299 1740 84 24 0 0 1s 100% 0% - 2% 0 25 0 276 265 0 0

When I stop the dedup scan for that volume, the snapmirror errors in my logs go away, so I assume they start up. My CPU usage climbs, and I still have the CP 100% to 0% cycling, but with a few Z states thrown in that weren't there before

CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 10% 307 0 0 308 2900 4825 16875 22066 0 0 0s 98% 100% Zf 8% 0 1 0 1 0 0 0 7% 499 0 0 505 10720 5103 3500 34560 0 0 2s 97% 100% :f 7% 0 6 0 2 4 0 0 14% 3325 0 0 3337 22871 5188 292 22084 0 0 2s 99% 100% :f 3% 4 8 0 0 5 0 0 9% 109 0 0 110 522 1474 10156 34608 0 0 2s 94% 100% Zf 9% 0 1 0 1 0 0 0 12% 157 0 0 160 980 2135 33448 30984 0 0 3s 90% 73% Z 14% 0 3 0 2 0 0 0 79% 178 0 0 179 2792 5136 176444 0 0 0 0s 100% 0% - 21% 0 1 0 0 0 0 0 82% 290 0 0 309 5569 5495 155284 0 0 0 0s 99% 0% - 14% 0 19 0 278 265 0 0 82% 3215 0 0 3241 14575 5853 162609 24 0 0 0s 100% 0% - 17% 9 13 0 2 1 0 0 88% 130 0 0 130 785 5190 195780 0 0 0 0s 100% 0% - 20% 0 0 0 0 0 0 0 93% 108 0 0 186 957 5091 202388 8 0 0 0s 100% 0% - 17% 0 78 0 1564 1454 0 0 94% 185 0 0 188 2551 5058 216120 24 0 0 0s 100% 0% - 19% 0 3 0 2 0 0 0 21% 519 0 0 519 6321 5028 34816 22612 0 0 0s 99% 90% Zf 7% 0 0 0 0 0 0 0 13% 3273 0 0 3363 11980 5900 5328 16108 0 0 0s 99% 100% :f 4% 89 1 0 1 0 0 0

snapmirrors would have been running during this time (3 minute intervals to a DR site), but no other dedup when I started the process

On 2013/07/27 6:21 AM, Fletcher Cocquyt wrote:

...

Hi Chris -

What version of DOT?

What does a sysstat -x 1 show (CPU and Disk Util wise)?

sysstat -x 1 CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 60% 7904 0 0 7904 31008 330735 232872 24 0 0 1 96% 0% - 54% 0 0 0 0 0 0 0 51% 7609 0 0 7609 4659 316694 264612 0 0 0 1 96% 0% - 39% 0 0 0 0 0 0 0 51% 7154 0 0 7159 3812 281360 204592 8 0 0 1 95% 0% - 48% 5 0 0 0 0 0 0

you can run a sis stop <vol> and re-run the sysstat -x 1 to compare the relative CPU & Disk Util

Do you have overlapping snapmirror or sis jobs running? If so, consider staggering their schedules to minimize load.

Fletcher

On Jul 26, 2013, at 9:04 PM, Chris Picton <chris@picton.nom.za mailto:chris@picton.nom.za> wrote:

...
Hi all

One of the volumes exported via NFS from my fas3210 didn't have dedup enabled when comissioned. It is 250GB, and hosts ploop backed openvz vms. It is currently using about 210GB, and hourly snapshot size is about 6GB.

When I run sis start -s on this volume, the entire system slows down to a crawl. My snmp monitoring start timing out, ssh access to the system is hit and miss, taking over a minute to log in, and when logged on, command response is sluggish. I also get the following error in the logs for all snapmirror pairs

SnapMirror: source transfer from TEST_TESTVOL to xx.yy.zz:TEST_TESTVOL : request denied, previous request still processing.

Fortunately, disk access from clients on this and other volumes are not detrimentally affected, but IO response times do go up by about 100ms.

After running overnight for 11 hours, sis status reports Progress: 19333120 KB Scanned Change Log Usage: 88% Logical Data: 151 GB/49 TB (0%)

At this rate, it will take about 5 days to finish scanning, leaving me barely able to manage the system effectively while this is happening.

Is this normal behaviour - do I just have to wait through it, or can I stop it and correct something before trying again. Also, is the change log filling up towards 100% something to worry about?

Regards Chris

Toasters mailing list Toasters@teaparty.net mailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Fletcher Cocquyt

5:09 a.m.

"snapmirrors would have been running during this time (3 minute intervals to a DR site)"

how many snapmirrors are you running at 3 minute intervals ? Those can a lot of IO and CPs

A while ago we discovered our snapmirrors were accounting for 50% of our IOPs (tuning the schedule fixed that):

http://www.vmadmin.info/2010/07/vmware-and-netapp-deconstructing.html

We also went through VM alignment on NFS - we were able to quantify the alignment then trend it as we aligned VMs:

http://www.vmadmin.info/2010/07/quantifying-vmdk-misalignment.html

On Jul 26, 2013, at 9:43 PM, Chris Picton chris@picton.nom.za wrote:

...

DOT version 8.1.2p4

I have done a 'sis stop /myvol', would restarting it again (sis start -s) lose all existing progress, or can I do a normal sis start and it will continue the scan from previous?

I can see my CP fluctuates between 100% (:f) and 0% (-) in about 10 second cycles (while the dedup is running) CPU usage spikes up to 20% intermittently, but stays low.

CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 4% 280 0 0 303 5226 92 4116 18152 0 0 6 100% 100% :f 3% 5 18 0 1 2 0 0 23% 6139 0 0 6217 27307 1638 4748 23752 0 0 0s 100% 100% :f 6% 72 6 0 1 0 0 0 8% 1729 0 0 1734 7395 1037 3184 19592 0 0 0s 96% 100% :f 4% 0 5 0 1 4 0 0 2% 75 0 0 79 568 89 2100 21972 0 0 0s 99% 100% :f 4% 0 4 0 1 4 0 0 4% 193 0 0 194 2554 1265 2364 22648 0 0 0s 93% 100% :f 4% 0 1 0 1 0 0 0 4% 208 0 0 215 4954 1706 9580 16776 0 0 0s 91% 72% : 4% 5 2 0 1 0 0 0 2% 211 0 0 226 1422 1015 492 0 0 0 0s 97% 0% - 2% 0 15 0 274 264 0 0 22% 6686 0 0 6715 29299 1740 84 24 0 0 1s 100% 0% - 2% 0 25 0 276 265 0 0

When I stop the dedup scan for that volume, the snapmirror errors in my logs go away, so I assume they start up. My CPU usage climbs, and I still have the CP 100% to 0% cycling, but with a few Z states thrown in that weren't there before

CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 10% 307 0 0 308 2900 4825 16875 22066 0 0 0s 98% 100% Zf 8% 0 1 0 1 0 0 0 7% 499 0 0 505 10720 5103 3500 34560 0 0 2s 97% 100% :f 7% 0 6 0 2 4 0 0 14% 3325 0 0 3337 22871 5188 292 22084 0 0 2s 99% 100% :f 3% 4 8 0 0 5 0 0 9% 109 0 0 110 522 1474 10156 34608 0 0 2s 94% 100% Zf 9% 0 1 0 1 0 0 0 12% 157 0 0 160 980 2135 33448 30984 0 0 3s 90% 73% Z 14% 0 3 0 2 0 0 0 79% 178 0 0 179 2792 5136 176444 0 0 0 0s 100% 0% - 21% 0 1 0 0 0 0 0 82% 290 0 0 309 5569 5495 155284 0 0 0 0s 99% 0% - 14% 0 19 0 278 265 0 0 82% 3215 0 0 3241 14575 5853 162609 24 0 0 0s 100% 0% - 17% 9 13 0 2 1 0 0 88% 130 0 0 130 785 5190 195780 0 0 0 0s 100% 0% - 20% 0 0 0 0 0 0 0 93% 108 0 0 186 957 5091 202388 8 0 0 0s 100% 0% - 17% 0 78 0 1564 1454 0 0 94% 185 0 0 188 2551 5058 216120 24 0 0 0s 100% 0% - 19% 0 3 0 2 0 0 0 21% 519 0 0 519 6321 5028 34816 22612 0 0 0s 99% 90% Zf 7% 0 0 0 0 0 0 0 13% 3273 0 0 3363 11980 5900 5328 16108 0 0 0s 99% 100% :f 4% 89 1 0 1 0 0 0

snapmirrors would have been running during this time (3 minute intervals to a DR site), but no other dedup when I started the process

On 2013/07/27 6:21 AM, Fletcher Cocquyt wrote:

...
Hi Chris -

What version of DOT?

What does a sysstat -x 1 show (CPU and Disk Util wise)?

sysstat -x 1 CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 60% 7904 0 0 7904 31008 330735 232872 24 0 0 1 96% 0% - 54% 0 0 0 0 0 0 0 51% 7609 0 0 7609 4659 316694 264612 0 0 0 1 96% 0% - 39% 0 0 0 0 0 0 0 51% 7154 0 0 7159 3812 281360 204592 8 0 0 1 95% 0% - 48% 5 0 0 0 0 0 0

you can run a sis stop <vol> and re-run the sysstat -x 1 to compare the relative CPU & Disk Util

Do you have overlapping snapmirror or sis jobs running? If so, consider staggering their schedules to minimize load.

Fletcher

On Jul 26, 2013, at 9:04 PM, Chris Picton <chris@picton.nom.za mailto:chris@picton.nom.za> wrote:

...
Hi all

One of the volumes exported via NFS from my fas3210 didn't have dedup enabled when comissioned. It is 250GB, and hosts ploop backed openvz vms. It is currently using about 210GB, and hourly snapshot size is about 6GB.

When I run sis start -s on this volume, the entire system slows down to a crawl. My snmp monitoring start timing out, ssh access to the system is hit and miss, taking over a minute to log in, and when logged on, command response is sluggish. I also get the following error in the logs for all snapmirror pairs

SnapMirror: source transfer from TEST_TESTVOL to xx.yy.zz:TEST_TESTVOL : request denied, previous request still processing.

Fortunately, disk access from clients on this and other volumes are not detrimentally affected, but IO response times do go up by about 100ms.

After running overnight for 11 hours, sis status reports Progress: 19333120 KB Scanned Change Log Usage: 88% Logical Data: 151 GB/49 TB (0%)

At this rate, it will take about 5 days to finish scanning, leaving me barely able to manage the system effectively while this is happening.

Is this normal behaviour - do I just have to wait through it, or can I stop it and correct something before trying again. Also, is the change log filling up towards 100% something to worry about?

Regards Chris

Toasters mailing list Toasters@teaparty.net mailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Chris Picton

5:44 a.m.

On 2013/07/27 7:09 AM, Fletcher Cocquyt wrote:

...

"snapmirrors would have been running during this time (3 minute intervals to a DR site)"

how many snapmirrors are you running at 3 minute intervals ? Those can a lot of IO and CPs

A while ago we discovered our snapmirrors were accounting for 50% of our IOPs (tuning the schedule fixed that):

15 snapmirrors - most are actually on 5 minute interval, the higher priority ones are on 3 minute. I will investigate if they can/should be tuned

I started the dedup on a different volume (250 Gb, 100Gb used, similar usage pattern), and it is running fine - it has so far scanned 50GB in 1 hour.

A reallocate measure on the problematic volume reports a value of 6, so it should probably be reallocated. I will increase the size of the volume, run a reallocate and try the dedup again later on.

Thanks Chris

Scott Eno

1:43 p.m.

Hi Chris,

I have seen this same behavior on a 3160 (8.1.2P3) trying to dedupe a single VMware datastore.

This datastore lived on an aggregate made up of 90 10k disks with 6 raid groups of 15 disks each. Even with reallocate keeping the data spread out, every time a dedupe job would start the 4 CPU cores would go to 100% until it finished (days later), or I stopped it. Snmp would stop responding, performance manager would have huge gaps in data collection for the controller, etc.

Figured it was too much work for the CPU to handle the "math" deduping data across that many disks, or a bug. But, as you say, other dedupe jobs on the same controller, on smaller aggrs, work fine.

They had 30+ TB of empty space on the aggr, so I just disabled dedupe and let the VMware volumes grow.

Sorry I don't have a solution for you, but wanted to let you know you weren't alone.

---- Scott Eno s.eno@me.com

On Jul 27, 2013, at 12:04 AM, Chris Picton chris@picton.nom.za wrote:

...

Hi all

One of the volumes exported via NFS from my fas3210 didn't have dedup enabled when comissioned. It is 250GB, and hosts ploop backed openvz vms. It is currently using about 210GB, and hourly snapshot size is about 6GB.

When I run sis start -s on this volume, the entire system slows down to a crawl. My snmp monitoring start timing out, ssh access to the system is hit and miss, taking over a minute to log in, and when logged on, command response is sluggish. I also get the following error in the logs for all snapmirror pairs

SnapMirror: source transfer from TEST_TESTVOL to xx.yy.zz:TEST_TESTVOL : request denied, previous request still processing.

Fortunately, disk access from clients on this and other volumes are not detrimentally affected, but IO response times do go up by about 100ms.

After running overnight for 11 hours, sis status reports Progress: 19333120 KB Scanned Change Log Usage: 88% Logical Data: 151 GB/49 TB (0%)

At this rate, it will take about 5 days to finish scanning, leaving me barely able to manage the system effectively while this is happening.

Is this normal behaviour - do I just have to wait through it, or can I stop it and correct something before trying again. Also, is the change log filling up towards 100% something to worry about?

Regards Chris

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Jordan Slingerland

4 p.m.

a few things, if you have gone from 8.1.2P3 to P4, you are probably going to want to do a sis start -s on all volumes over 50% or so (netapp says 70, I say 50) and any volumes on aggregates in the same boat.

I do not have the bug number off hand but there is a bug fixed in 8.1.2P4 that fixes an issue with deduplicaiton and causes your next dedup run to inflate the volumes by as much as 30%. I cant help wonder if this bug could be related to your issue. If you cant figure out the bug I am talking about, I can dig through my emails.

Also, just a shot in the dark. have you ran a statit to possibly see if a certain disk could be the source of a bottle neck?

________________________________________ From: toasters-bounces@teaparty.net [toasters-bounces@teaparty.net] on behalf of Scott Eno [s.eno@me.com] Sent: Saturday, July 27, 2013 9:43 AM To: Chris Picton Cc: toasters@teaparty.net Subject: Re: sis start -s causing system slowdown

Hi Chris,

I have seen this same behavior on a 3160 (8.1.2P3) trying to dedupe a single VMware datastore.

Figured it was too much work for the CPU to handle the "math" deduping data across that many disks, or a bug. But, as you say, other dedupe jobs on the same controller, on smaller aggrs, work fine.

They had 30+ TB of empty space on the aggr, so I just disabled dedupe and let the VMware volumes grow.

Sorry I don't have a solution for you, but wanted to let you know you weren't alone.

---- Scott Eno s.eno@me.com

On Jul 27, 2013, at 12:04 AM, Chris Picton chris@picton.nom.za wrote:

...

Hi all

One of the volumes exported via NFS from my fas3210 didn't have dedup enabled when comissioned. It is 250GB, and hosts ploop backed openvz vms. It is currently using about 210GB, and hourly snapshot size is about 6GB.

When I run sis start -s on this volume, the entire system slows down to a crawl. My snmp monitoring start timing out, ssh access to the system is hit and miss, taking over a minute to log in, and when logged on, command response is sluggish. I also get the following error in the logs for all snapmirror pairs

SnapMirror: source transfer from TEST_TESTVOL to xx.yy.zz:TEST_TESTVOL : request denied, previous request still processing.

Fortunately, disk access from clients on this and other volumes are not detrimentally affected, but IO response times do go up by about 100ms.

After running overnight for 11 hours, sis status reports Progress: 19333120 KB Scanned Change Log Usage: 88% Logical Data: 151 GB/49 TB (0%)

At this rate, it will take about 5 days to finish scanning, leaving me barely able to manage the system effectively while this is happening.

Is this normal behaviour - do I just have to wait through it, or can I stop it and correct something before trying again. Also, is the change log filling up towards 100% something to worry about?

Regards Chris

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Jordan Slingerland

4:04 p.m.

I found the bug I was thinking of:

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=657692

it certainly caused some headaches for me after upgrading to 8.1.2P4

---JMS ________________________________________ From: toasters-bounces@teaparty.net [toasters-bounces@teaparty.net] on behalf of Jordan Slingerland [Jordan.Slingerland@independenthealth.com] Sent: Saturday, July 27, 2013 12:00 PM To: Scott Eno; Chris Picton Cc: toasters@teaparty.net Subject: RE: sis start -s causing system slowdown

Also, just a shot in the dark. have you ran a statit to possibly see if a certain disk could be the source of a bottle neck?

Hi Chris,

I have seen this same behavior on a 3160 (8.1.2P3) trying to dedupe a single VMware datastore.

Figured it was too much work for the CPU to handle the "math" deduping data across that many disks, or a bug. But, as you say, other dedupe jobs on the same controller, on smaller aggrs, work fine.

They had 30+ TB of empty space on the aggr, so I just disabled dedupe and let the VMware volumes grow.

Sorry I don't have a solution for you, but wanted to let you know you weren't alone.

---- Scott Eno s.eno@me.com

On Jul 27, 2013, at 12:04 AM, Chris Picton chris@picton.nom.za wrote:

...

Hi all

One of the volumes exported via NFS from my fas3210 didn't have dedup enabled when comissioned. It is 250GB, and hosts ploop backed openvz vms. It is currently using about 210GB, and hourly snapshot size is about 6GB.

When I run sis start -s on this volume, the entire system slows down to a crawl. My snmp monitoring start timing out, ssh access to the system is hit and miss, taking over a minute to log in, and when logged on, command response is sluggish. I also get the following error in the logs for all snapmirror pairs

SnapMirror: source transfer from TEST_TESTVOL to xx.yy.zz:TEST_TESTVOL : request denied, previous request still processing.

Fortunately, disk access from clients on this and other volumes are not detrimentally affected, but IO response times do go up by about 100ms.

After running overnight for 11 hours, sis status reports Progress: 19333120 KB Scanned Change Log Usage: 88% Logical Data: 151 GB/49 TB (0%)

At this rate, it will take about 5 days to finish scanning, leaving me barely able to manage the system effectively while this is happening.

Is this normal behaviour - do I just have to wait through it, or can I stop it and correct something before trying again. Also, is the change log filling up towards 100% something to worry about?

Regards Chris

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

James Andrewartha

31 Jul 31 Jul

6:17 a.m.

On 28/07/13 00:04, Jordan Slingerland wrote:

...

I found the bug I was thinking of:

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=657692

it certainly caused some headaches for me after upgrading to 8.1.2P4

To further veer off-topic, this does explain the increase in disk usage I was seeing on some volumes, running 8.1.2P3. I ran sis start -s on them, and got back 15-30%. Annoyingly this went straight into snapshot usage, it seems a bit silly that the dedupe metadata is snapshotted.

Interestingly there was only a small saving on the snapvault secondary volumes, which are also on a filer running 8.1.2P3.

-- James Andrewartha Network & Projects Engineer Christ Church Grammar School Claremont, Western Australia Ph. (08) 9442 1757 Mob. 0424 160 877 > ________________________________________ > From: toasters-bounces@teaparty.net [toasters-bounces@teaparty.net] on behalf of Jordan Slingerland [Jordan.Slingerland@independenthealth.com] > Sent: Saturday, July 27, 2013 12:00 PM > To: Scott Eno; Chris Picton > Cc: toasters@teaparty.net > Subject: RE: sis start -s causing system slowdown > > a few things, if you have gone from 8.1.2P3 to P4, you are probably going to want to do a sis start -s on all volumes over 50% or so (netapp says 70, I say 50) and any volumes on aggregates in the same boat. > > I do not have the bug number off hand but there is a bug fixed in 8.1.2P4 that fixes an issue with deduplicaiton and causes your next dedup run to inflate the volumes by as much as 30%. I cant help wonder if this bug could be related to your issue. If you cant figure out the bug I am talking about, I can dig through my emails. > > Also, just a shot in the dark. have you ran a statit to possibly see if a certain disk could be the source of a bottle neck? > > > > ________________________________________ > From: toasters-bounces@teaparty.net [toasters-bounces@teaparty.net] on behalf of Scott Eno [s.eno@me.com] > Sent: Saturday, July 27, 2013 9:43 AM > To: Chris Picton > Cc: toasters@teaparty.net > Subject: Re: sis start -s causing system slowdown > > Hi Chris, > > I have seen this same behavior on a 3160 (8.1.2P3) trying to dedupe a single VMware datastore. > > This datastore lived on an aggregate made up of 90 10k disks with 6 raid groups of 15 disks each. Even with reallocate keeping the data spread out, every time a dedupe job would start the 4 CPU cores would go to 100% until it finished (days later), or I stopped it. Snmp would stop responding, performance manager would have huge gaps in data collection for the controller, etc. > > Figured it was too much work for the CPU to handle the "math" deduping data across that many disks, or a bug. But, as you say, other dedupe jobs on the same controller, on smaller aggrs, work fine. > > They had 30+ TB of empty space on the aggr, so I just disabled dedupe and let the VMware volumes grow. > > Sorry I don't have a solution for you, but wanted to let you know you weren't alone. > > ---- > Scott Eno > s.eno@me.com > > On Jul 27, 2013, at 12:04 AM, Chris Picton chris@picton.nom.za wrote: > >> Hi all >> >> One of the volumes exported via NFS from my fas3210 didn't have dedup enabled when comissioned. It is 250GB, and hosts ploop backed openvz vms. It is currently using about 210GB, and hourly snapshot size is about 6GB. >> >> When I run sis start -s on this volume, the entire system slows down to a crawl. My snmp monitoring start timing out, ssh access to the system is hit and miss, taking over a minute to log in, and when logged on, command response is sluggish. I also get the following error in the logs for all snapmirror pairs >> >> SnapMirror: source transfer from TEST_TESTVOL to xx.yy.zz:TEST_TESTVOL : request denied, previous request still processing. >> >> Fortunately, disk access from clients on this and other volumes are not detrimentally affected, but IO response times do go up by about 100ms. >> >> After running overnight for 11 hours, sis status reports >> Progress: 19333120 KB Scanned >> Change Log Usage: 88% >> Logical Data: 151 GB/49 TB (0%) >> >> >> At this rate, it will take about 5 days to finish scanning, leaving me barely able to manage the system effectively while this is happening. >> >> Is this normal behaviour - do I just have to wait through it, or can I stop it and correct something before trying again. Also, is the change log filling up towards 100% something to worry about? >> >> Regards >> Chris

Martin

8:33 a.m.

Hi James,

We've had exactly the same issue, I believe Netapp moved the metadata from the aggregate into the volume a while back (maybe to ease migrating deduped volumes between aggrs???).

We also have a VMware NFS volume that currently has 1TB of data unaccounted for, the Filer shows 1TB more space available than VMware can see. When I VSM'd this volume to a new volume it shows the correct space usage both on VMware and Netapp. The VSM appears to ignore replication of the metadata but as you mentioned it does get included in snapshots. The difference with this volume is the "sis start -s" doesn't remove the metadata, it was suggested to reset the SIS data but I am reluctant to do this and will just create a new volume and datastore and storage vmotion into that.

We're currently running 8.1.2P1 and waiting until 8.1.3P1 to resolve this and several other issues.

-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/sis-start-s-causing-sy... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.

James Andrewartha

21 Aug 21 Aug

6:58 a.m.

On 31/07/13 16:33, Martin wrote:

...

We're currently running 8.1.2P1 and waiting until 8.1.3P1 to resolve this and several other issues.

8.1.3P1 has been out for a few weeks, anyone tried it yet? I'm going to be upgrading to it on the weekend.

-- James Andrewartha Network & Projects Engineer Christ Church Grammar School Claremont, Western Australia Ph. (08) 9442 1757 Mob. 0424 160 877

Momonth

8:50 p.m.

Hi,

I've got two systems upgraded just yesterday, no issues seen so far.

Cheers, Vladimir

On Wednesday, August 21, 2013, James Andrewartha wrote:

...

On 31/07/13 16:33, Martin wrote:

...
We're currently running 8.1.2P1 and waiting until 8.1.3P1 to resolve this and several other issues.

8.1.3P1 has been out for a few weeks, anyone tried it yet? I'm going to be upgrading to it on the weekend.

Alexander Griesser

8:54 p.m.

I also upgraded a pair of 3240s - no issues so far.

bye,

Alexander Griesser System-Administrator

ANEXIA Internetdienstleistungs GmbH

Telefon: +43-463-208501-320tel:+43-463-208501-320 Telefax: +43-463-208501-500tel:+43-463-208501-500

E-Mail: ag@anexia.atmailto:ag@anexia.at Web: http://www.anexia.at http://www.anexia.at/

Anschrift Hauptsitz Klagenfurt:Feldkirchnerstra?e 140, 9020 Klagenfurtx-apple-data-detectors://9 Gesch?ftsf?hrer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601

Am 21.08.2013 um 22:50 schrieb "Momonth" <momonth@gmail.commailto:momonth@gmail.com>:

Hi,

I've got two systems upgraded just yesterday, no issues seen so far.

Cheers, Vladimir

On Wednesday, August 21, 2013, James Andrewartha wrote: On 31/07/13 16:33, Martin wrote:

...

We're currently running 8.1.2P1 and waiting until 8.1.3P1 to resolve this and several other issues.

8.1.3P1 has been out for a few weeks, anyone tried it yet? I'm going to be upgrading to it on the weekend.

_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Scott Eno

27 Jul 27 Jul

4:09 p.m.

In looking at busy disks in performance manager (NMC) the busiest disks in the aggregate never passed 10% utilization, which I would expect from an aggr of 90 disks.

Since Chris is at P4, perhaps he's seeing that bug.

---- Scott Eno s.eno@me.com

On Jul 27, 2013, at 12:00 PM, Jordan Slingerland Jordan.Slingerland@independenthealth.com wrote:

...

a few things, if you have gone from 8.1.2P3 to P4, you are probably going to want to do a sis start -s on all volumes over 50% or so (netapp says 70, I say 50) and any volumes on aggregates in the same boat.

I do not have the bug number off hand but there is a bug fixed in 8.1.2P4 that fixes an issue with deduplicaiton and causes your next dedup run to inflate the volumes by as much as 30%. I cant help wonder if this bug could be related to your issue. If you cant figure out the bug I am talking about, I can dig through my emails.

Also, just a shot in the dark. have you ran a statit to possibly see if a certain disk could be the source of a bottle neck?

From: toasters-bounces@teaparty.net [toasters-bounces@teaparty.net] on behalf of Scott Eno [s.eno@me.com] Sent: Saturday, July 27, 2013 9:43 AM To: Chris Picton Cc: toasters@teaparty.net Subject: Re: sis start -s causing system slowdown

Hi Chris,

I have seen this same behavior on a 3160 (8.1.2P3) trying to dedupe a single VMware datastore.

This datastore lived on an aggregate made up of 90 10k disks with 6 raid groups of 15 disks each. Even with reallocate keeping the data spread out, every time a dedupe job would start the 4 CPU cores would go to 100% until it finished (days later), or I stopped it. Snmp would stop responding, performance manager would have huge gaps in data collection for the controller, etc.

Figured it was too much work for the CPU to handle the "math" deduping data across that many disks, or a bug. But, as you say, other dedupe jobs on the same controller, on smaller aggrs, work fine.

They had 30+ TB of empty space on the aggr, so I just disabled dedupe and let the VMware volumes grow.

Sorry I don't have a solution for you, but wanted to let you know you weren't alone.

Scott Eno s.eno@me.com

On Jul 27, 2013, at 12:04 AM, Chris Picton chris@picton.nom.za wrote:

...
Hi all

One of the volumes exported via NFS from my fas3210 didn't have dedup enabled when comissioned. It is 250GB, and hosts ploop backed openvz vms. It is currently using about 210GB, and hourly snapshot size is about 6GB.

When I run sis start -s on this volume, the entire system slows down to a crawl. My snmp monitoring start timing out, ssh access to the system is hit and miss, taking over a minute to log in, and when logged on, command response is sluggish. I also get the following error in the logs for all snapmirror pairs

SnapMirror: source transfer from TEST_TESTVOL to xx.yy.zz:TEST_TESTVOL : request denied, previous request still processing.

Fortunately, disk access from clients on this and other volumes are not detrimentally affected, but IO response times do go up by about 100ms.

After running overnight for 11 hours, sis status reports Progress: 19333120 KB Scanned Change Log Usage: 88% Logical Data: 151 GB/49 TB (0%)

At this rate, it will take about 5 days to finish scanning, leaving me barely able to manage the system effectively while this is happening.

Is this normal behaviour - do I just have to wait through it, or can I stop it and correct something before trying again. Also, is the change log filling up towards 100% something to worry about?

Regards Chris

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Chris Picton

7:29 p.m.

On 2013/07/27 6:00 PM, Jordan Slingerland wrote:

...

a few things, if you have gone from 8.1.2P3 to P4, you are probably going to want to do a sis start -s on all volumes over 50% or so (netapp says 70, I say 50) and any volumes on aggregates in the same boat.

I do not have the bug number off hand but there is a bug fixed in 8.1.2P4 that fixes an issue with deduplicaiton and causes your next dedup run to inflate the volumes by as much as 30%. I cant help wonder if this bug could be related to your issue. If you cant figure out the bug I am talking about, I can dig through my emails.

Also, just a shot in the dark. have you ran a statit to possibly see if a certain disk could be the source of a bottle neck?

All volumes are currently on the same aggregate (48 disks). Only one of the volumes so far is exhibiting the strange behaviour. Atfter reallocating and growing the volume to give it 40% free space, I still get the same behaviour when trying to initialise dedup on the volume for the first time.

I think I will migrate vm images off this volume onto one which is already dedup enabled. This is seeming like the easiest plan right now

Chris

4476

Age (days ago)

4501

Last active (days ago)

toasters@lists.teaparty.net

16 comments

9 participants

tags (0)

participants (9)

Alexander Griesser
Chris Picton
Fletcher Cocquyt
James Andrewartha
Jordan Slingerland
Klise, Steve
Martin
Momonth
Scott Eno