Determining what's contributing to fast aggregate growth

List overview All Threads
Download

newer

older

Stupid question

infinite volumes in the real world

Fletcher Cocquyt

2 Apr 2014 2 Apr '14

4:24 p.m.

Hi all,

In the last 36 hours or so we have a 19Tb aggregate that is growing above 18Tb used. Usually the aggregate used level only grows if we grow its volumes. This is different - I was forced to delete snapshots and shrink volumes to get it back under 90%. And in the last 3 hours its back above 91% - used level is climbing 5-10g/minute

I so far can not see where the growth is coming from, Aggr snapshot is OFF

Ontap 8.1.2

na02> aggr status Aggr State Status Options aggr0 online raid_dp, aggr root, nosnap=on, raidsize=19 64-bit

thanks for any tips!

Attachments:

attachment.html (text/html — 9.4 KB)

Show replies by date

Alexander Griesser

2 Apr 2 Apr

4:31 p.m.

New subject: AW: Determining what's contributing to fast aggregate growth

Hi Fletcher,

can you run `aggr show_space -h`, then wait 10 minutes and run it again? You should at least immediately see which volume is causing the growth if the growth is still happening.

Best,

Alexander Griesser System-Administrator

ANEXIA Internetdienstleistungs GmbH

Telefon: +43-5-0556-320 Telefax: +43-5-0556-500

E-Mail: ag@anexia.atmailto:ag@anexia.at Web: http://www.anexia.at http://www.anexia.at/

Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601

Von: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Im Auftrag von Fletcher Cocquyt Gesendet: Mittwoch, 02. April 2014 18:24 An: toasters@teaparty.net Lists Betreff: Determining what's contributing to fast aggregate growth

Hi all,

I so far can not see where the growth is coming from, Aggr snapshot is OFF

Ontap 8.1.2

na02> aggr status Aggr State Status Options aggr0 online raid_dp, aggr root, nosnap=on, raidsize=19 64-bit

thanks for any tips!

Tim Stiller

4:32 p.m.

Hi Fletcher,

SIS running?

BUG* 657692: *Stale metadata not automatically removed during deduplication operations on volume *http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=657692 http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=657692*

https://forums.netapp.com/thread/42487

regards, Tim

2014-04-02 18:24 GMT+02:00 Fletcher Cocquyt fcocquyt@stanford.edu:

...

Hi all,

In the last 36 hours or so we have a 19Tb aggregate that is growing above 18Tb used. Usually the aggregate used level only grows if we grow its volumes. This is different - I was forced to delete snapshots and shrink volumes to get it back under 90%. And in the last 3 hours its back above 91% - used level is climbing 5-10g/minute

I so far can not see where the growth is coming from, Aggr snapshot is OFF

Ontap 8.1.2

na02> aggr status Aggr State Status Options aggr0 online raid_dp, aggr root, nosnap=on, raidsize=19 64-bit

thanks for any tips!

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Fletcher Cocquyt

4:36 p.m.

Yes sis is active

this may be it

na02> sis status -l

Path: /vol/vm65net State: Enabled Compression: Disabled Inline Compression: Disabled Status: Active Progress: 0 KB (0%) Done Type: Regular Schedule: tue-thu@23 Minimum Blocks Shared: 1 Blocks Skipped Sharing: 0 Last Operation State: Success Last Successful Operation Begin: Thu Mar 27 23:00:00 PDT 2014 Last Successful Operation End: Fri Mar 28 07:01:44 PDT 2014 Last Operation Begin: Thu Mar 27 23:00:00 PDT 2014 Last Operation End: Fri Mar 28 07:01:44 PDT 2014 Last Operation Size: 354 GB Last Operation Error: - Change Log Usage: 2% Logical Data: 8220 GB/69 TB (11%) Queued Job: - Stale Fingerprints: 1%

na02> sis stop /vol/vm65net The operation on "/vol/vm65net" is being stopped. irt-na02> Wed Apr 2 09:35:45 PDT [irt-na02:sis.op.stopped:error]: SIS operation for /vol/vm65net has stopped

Stopped - will see if the stops the growth

thanks!

On Apr 2, 2014, at 9:32 AM, Tim Stiller tim.stiller@gmail.com wrote:

...

Hi Fletcher,

SIS running?

BUG 657692: Stale metadata not automatically removed during deduplication operations on volume http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=657692

https://forums.netapp.com/thread/42487

regards, Tim

2014-04-02 18:24 GMT+02:00 Fletcher Cocquyt fcocquyt@stanford.edu: Hi all,

In the last 36 hours or so we have a 19Tb aggregate that is growing above 18Tb used. Usually the aggregate used level only grows if we grow its volumes. This is different - I was forced to delete snapshots and shrink volumes to get it back under 90%. And in the last 3 hours its back above 91% - used level is climbing 5-10g/minute

I so far can not see where the growth is coming from, Aggr snapshot is OFF

Ontap 8.1.2

na02> aggr status Aggr State Status Options aggr0 online raid_dp, aggr root, nosnap=on, raidsize=19 64-bit

thanks for any tips!

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Fletcher Cocquyt

6 p.m.

The aggregate growth has stopped since I stopped sis on the vm65net volume. Thanks Tim, and all who replied - this bug was going to fill up our aggregate with 100's of VMs running otherwise.

Now I get to read up more on it and try to reclaim the space Strange this would happen after being stable for so long running de-dup without issue

thanks again, Fletcher

On Apr 2, 2014, at 9:36 AM, Fletcher Cocquyt fcocquyt@stanford.edu wrote:

...

Yes sis is active

this may be it

na02> sis status -l

Path: /vol/vm65net State: Enabled Compression: Disabled Inline Compression: Disabled Status: Active Progress: 0 KB (0%) Done Type: Regular Schedule: tue-thu@23 Minimum Blocks Shared: 1 Blocks Skipped Sharing: 0 Last Operation State: Success Last Successful Operation Begin: Thu Mar 27 23:00:00 PDT 2014 Last Successful Operation End: Fri Mar 28 07:01:44 PDT 2014 Last Operation Begin: Thu Mar 27 23:00:00 PDT 2014 Last Operation End: Fri Mar 28 07:01:44 PDT 2014 Last Operation Size: 354 GB Last Operation Error: - Change Log Usage: 2% Logical Data: 8220 GB/69 TB (11%) Queued Job: - Stale Fingerprints: 1%

na02> sis stop /vol/vm65net The operation on "/vol/vm65net" is being stopped. irt-na02> Wed Apr 2 09:35:45 PDT [irt-na02:sis.op.stopped:error]: SIS operation for /vol/vm65net has stopped

Stopped - will see if the stops the growth

thanks!

On Apr 2, 2014, at 9:32 AM, Tim Stiller tim.stiller@gmail.com wrote:

...
Hi Fletcher,

SIS running?

BUG 657692: Stale metadata not automatically removed during deduplication operations on volume http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=657692

https://forums.netapp.com/thread/42487

regards, Tim

2014-04-02 18:24 GMT+02:00 Fletcher Cocquyt fcocquyt@stanford.edu: Hi all,

In the last 36 hours or so we have a 19Tb aggregate that is growing above 18Tb used. Usually the aggregate used level only grows if we grow its volumes. This is different - I was forced to delete snapshots and shrink volumes to get it back under 90%. And in the last 3 hours its back above 91% - used level is climbing 5-10g/minute

I so far can not see where the growth is coming from, Aggr snapshot is OFF

Ontap 8.1.2

na02> aggr status Aggr State Status Options aggr0 online raid_dp, aggr root, nosnap=on, raidsize=19 64-bit

thanks for any tips!

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Jordan Slingerland

6:11 p.m.

I think if you dedup from the beginning it should ditch all the old metadata and rebuild the fingerprint database, hopefully you won't run into the bug again.

sis start -s /vol/vm_volume

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Fletcher Cocquyt Sent: Wednesday, April 02, 2014 2:01 PM To: Tim Stiller Cc: toasters@teaparty.net Lists Subject: Re: Determining what's contributing to fast aggregate growth

The aggregate growth has stopped since I stopped sis on the vm65net volume. Thanks Tim, and all who replied - this bug was going to fill up our aggregate with 100's of VMs running otherwise.

Now I get to read up more on it and try to reclaim the space Strange this would happen after being stable for so long running de-dup without issue

thanks again, Fletcher

On Apr 2, 2014, at 9:36 AM, Fletcher Cocquyt <fcocquyt@stanford.edumailto:fcocquyt@stanford.edu> wrote:

Yes sis is active

this may be it

na02> sis status -l

na02> sis stop /vol/vm65net The operation on "/vol/vm65net" is being stopped. irt-na02> Wed Apr 2 09:35:45 PDT [irt-na02:sis.op.stopped:error]: SIS operation for /vol/vm65net has stopped

Stopped - will see if the stops the growth

thanks!

On Apr 2, 2014, at 9:32 AM, Tim Stiller <tim.stiller@gmail.commailto:tim.stiller@gmail.com> wrote:

Hi Fletcher,

SIS running?

BUG 657692: Stale metadata not automatically removed during deduplication operations on volume http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=657692

https://forums.netapp.com/thread/42487 regards, Tim

2014-04-02 18:24 GMT+02:00 Fletcher Cocquyt <fcocquyt@stanford.edumailto:fcocquyt@stanford.edu>: Hi all,

I so far can not see where the growth is coming from, Aggr snapshot is OFF

Ontap 8.1.2

na02> aggr status Aggr State Status Options aggr0 online raid_dp, aggr root, nosnap=on, raidsize=19 64-bit

thanks for any tips!

_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Fletcher Cocquyt

10:47 p.m.

Can confirm

na02> sis start -s /vol/vm65net The file system will be scanned to process existing data in /vol/vm65net. This operation may initialize related existing metafiles. Are you sure you want to proceed (y/n)? y

has brought the aggregate from 90% (18/19Tb) to 82% (16/19Tb) in the last 90 minutes (and space is still being freed)

Pretty ironic a feature (de-dup) designed to save space almost ate all of it! (we have been on 8.1.2 for over a year, and had the same sis config running fine up until this week)

Thanks again

On Apr 2, 2014, at 11:11 AM, Jordan Slingerland Jordan.Slingerland@independenthealth.com wrote:

...

I think if you dedup from the beginning it should ditch all the old metadata and rebuild the fingerprint database, hopefully you won’t run into the bug again.

sis start –s /vol/vm_volume

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Fletcher Cocquyt Sent: Wednesday, April 02, 2014 2:01 PM To: Tim Stiller Cc: toasters@teaparty.net Lists Subject: Re: Determining what's contributing to fast aggregate growth

The aggregate growth has stopped since I stopped sis on the vm65net volume. Thanks Tim, and all who replied - this bug was going to fill up our aggregate with 100's of VMs running otherwise.

Now I get to read up more on it and try to reclaim the space Strange this would happen after being stable for so long running de-dup without issue

thanks again, Fletcher

On Apr 2, 2014, at 9:36 AM, Fletcher Cocquyt fcocquyt@stanford.edu wrote:

Yes sis is active

this may be it

na02> sis status -l

Path: /vol/vm65net State: Enabled Compression: Disabled Inline Compression: Disabled Status: Active Progress: 0 KB (0%) Done Type: Regular Schedule: tue-thu@23 Minimum Blocks Shared: 1 Blocks Skipped Sharing: 0 Last Operation State: Success Last Successful Operation Begin: Thu Mar 27 23:00:00 PDT 2014 Last Successful Operation End: Fri Mar 28 07:01:44 PDT 2014 Last Operation Begin: Thu Mar 27 23:00:00 PDT 2014 Last Operation End: Fri Mar 28 07:01:44 PDT 2014 Last Operation Size: 354 GB Last Operation Error: - Change Log Usage: 2% Logical Data: 8220 GB/69 TB (11%) Queued Job: - Stale Fingerprints: 1%

na02> sis stop /vol/vm65net The operation on "/vol/vm65net" is being stopped. irt-na02> Wed Apr 2 09:35:45 PDT [irt-na02:sis.op.stopped:error]: SIS operation for /vol/vm65net has stopped

Stopped - will see if the stops the growth

thanks!

On Apr 2, 2014, at 9:32 AM, Tim Stiller tim.stiller@gmail.com wrote:

Hi Fletcher,

SIS running?

BUG 657692: Stale metadata not automatically removed during deduplication operations on volume http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=657692

https://forums.netapp.com/thread/42487

regards, Tim

2014-04-02 18:24 GMT+02:00 Fletcher Cocquyt fcocquyt@stanford.edu: Hi all,

In the last 36 hours or so we have a 19Tb aggregate that is growing above 18Tb used. Usually the aggregate used level only grows if we grow its volumes. This is different - I was forced to delete snapshots and shrink volumes to get it back under 90%. And in the last 3 hours its back above 91% - used level is climbing 5-10g/minute

I so far can not see where the growth is coming from, Aggr snapshot is OFF

Ontap 8.1.2

na02> aggr status Aggr State Status Options aggr0 online raid_dp, aggr root, nosnap=on, raidsize=19 64-bit

thanks for any tips!

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Fletcher Cocquyt

16 Apr 16 Apr

6:40 p.m.

New subject: FIXED: Determining what's contributing to fast aggregate growth - dedup metadata

Just wanted to followup - if you are running <=8.1.2 with dedup your aggregates may be growing due to metadata from sis.

run a sis status -l /vol/<volname> check the status

Last Operation State: Failure Last Successful Operation Begin: Sun Jun 16 23:00:00 PDT 2013 Last Successful Operation End: Sun Jun 23 01:35:09 PDT 2013 Last Operation Begin: Thu Apr 3 10:52:07 PDT 2014 Last Operation End: Thu Apr 3 11:13:54 PDT 2014

and run sis start -s /vol/<volname> to reclaim the space

You get the aggrgate space back and as a side effect you might notice the dedup jobs that had been failing are now returning 20% savings again

Better monitoring/alerting about failed dedup jobs would also prevent this

thanks

On Apr 2, 2014, at 3:47 PM, Fletcher Cocquyt fcocquyt@stanford.edu wrote:

...

Can confirm

na02> sis start -s /vol/vm65net The file system will be scanned to process existing data in /vol/vm65net. This operation may initialize related existing metafiles. Are you sure you want to proceed (y/n)? y

has brought the aggregate from 90% (18/19Tb) to 82% (16/19Tb) in the last 90 minutes (and space is still being freed)

Pretty ironic a feature (de-dup) designed to save space almost ate all of it! (we have been on 8.1.2 for over a year, and had the same sis config running fine up until this week)

Thanks again

On Apr 2, 2014, at 11:11 AM, Jordan Slingerland Jordan.Slingerland@independenthealth.com wrote:

...
I think if you dedup from the beginning it should ditch all the old metadata and rebuild the fingerprint database, hopefully you won’t run into the bug again.

sis start –s /vol/vm_volume

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Fletcher Cocquyt Sent: Wednesday, April 02, 2014 2:01 PM To: Tim Stiller Cc: toasters@teaparty.net Lists Subject: Re: Determining what's contributing to fast aggregate growth

The aggregate growth has stopped since I stopped sis on the vm65net volume. Thanks Tim, and all who replied - this bug was going to fill up our aggregate with 100's of VMs running otherwise.

Now I get to read up more on it and try to reclaim the space Strange this would happen after being stable for so long running de-dup without issue

thanks again, Fletcher

On Apr 2, 2014, at 9:36 AM, Fletcher Cocquyt fcocquyt@stanford.edu wrote:

Yes sis is active

this may be it

na02> sis status -l

Path: /vol/vm65net State: Enabled Compression: Disabled Inline Compression: Disabled Status: Active Progress: 0 KB (0%) Done Type: Regular Schedule: tue-thu@23 Minimum Blocks Shared: 1 Blocks Skipped Sharing: 0 Last Operation State: Success Last Successful Operation Begin: Thu Mar 27 23:00:00 PDT 2014 Last Successful Operation End: Fri Mar 28 07:01:44 PDT 2014 Last Operation Begin: Thu Mar 27 23:00:00 PDT 2014 Last Operation End: Fri Mar 28 07:01:44 PDT 2014 Last Operation Size: 354 GB Last Operation Error: - Change Log Usage: 2% Logical Data: 8220 GB/69 TB (11%) Queued Job: - Stale Fingerprints: 1%

na02> sis stop /vol/vm65net The operation on "/vol/vm65net" is being stopped. irt-na02> Wed Apr 2 09:35:45 PDT [irt-na02:sis.op.stopped:error]: SIS operation for /vol/vm65net has stopped

Stopped - will see if the stops the growth

thanks!

On Apr 2, 2014, at 9:32 AM, Tim Stiller tim.stiller@gmail.com wrote:

Hi Fletcher,

SIS running?

BUG 657692: Stale metadata not automatically removed during deduplication operations on volume http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=657692

https://forums.netapp.com/thread/42487

regards, Tim

2014-04-02 18:24 GMT+02:00 Fletcher Cocquyt fcocquyt@stanford.edu: Hi all,

In the last 36 hours or so we have a 19Tb aggregate that is growing above 18Tb used. Usually the aggregate used level only grows if we grow its volumes. This is different - I was forced to delete snapshots and shrink volumes to get it back under 90%. And in the last 3 hours its back above 91% - used level is climbing 5-10g/minute

I so far can not see where the growth is coming from, Aggr snapshot is OFF

Ontap 8.1.2

na02> aggr status Aggr State Status Options aggr0 online raid_dp, aggr root, nosnap=on, raidsize=19 64-bit

thanks for any tips!

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Jordan Slingerland

2 Apr 2 Apr

4:35 p.m.

First, let's get a little information

Could you give us the output of the following commands.

(I am assuming your problem aggregate is aggr0)

aggr status -v aggr0 df -hA aggr0 snap list -A aggr0 aggr show_space aggr0

Next, can we narrow down the growth to a single volume or group of volumes?

How about providing a df -h for each volume and also a snap list <volume> for any volumes using excessive snapshot space.

--JMS

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Fletcher Cocquyt Sent: Wednesday, April 02, 2014 12:24 PM To: toasters@teaparty.net Lists Subject: Determining what's contributing to fast aggregate growth

Hi all,

I so far can not see where the growth is coming from, Aggr snapshot is OFF

Ontap 8.1.2

na02> aggr status Aggr State Status Options aggr0 online raid_dp, aggr root, nosnap=on, raidsize=19 64-bit

thanks for any tips!

Jordan Slingerland

4:42 p.m.

It looks like Tim and Alexander also gave you a few good suggestions which are both good places to look.

What protocol does the filer serve as we may be able to get an idea where the writes are coming from with per client stats.

For NFS

Set options nfs.per_client_stats.enable on

First zero counters ssh ntap1 vfiler run vfiler0 nfsstat -z

then list the per client stats and repeat a few minutes later to see what client is sending all the nfs writes. ssh ntap1 vfiler run vfiler0 nfsstat -l

for CIFS

set options cifs.per_client_stats.enable on

cifs top -n 20

--JMS

Hi all,

I so far can not see where the growth is coming from, Aggr snapshot is OFF

Ontap 8.1.2

na02> aggr status Aggr State Status Options aggr0 online raid_dp, aggr root, nosnap=on, raidsize=19 64-bit

thanks for any tips!

4372

Age (days ago)

4386

Last active (days ago)

toasters@lists.teaparty.net

9 comments

4 participants

tags (0)

participants (4)

Alexander Griesser
Fletcher Cocquyt
Jordan Slingerland
Tim Stiller