Re: Is there a best practice for adjust RAID "scrub" options?

31 May 2019


      So it’s possible that someone at NetApp has done some further analysis on this;
but my take on this is that this is the process that:
* validates that it can read data from a disk
* validates the RAID checksums
* validates the WAFL block checksums
* does other validity checks on the RG, aggr, filesystem, etc.
Because you don’t want problems to add up (especially, you don’t want problems
to add up to the point where you discover 3 read errors in the same stripe during
a rebuild!) you want to find and repair the issues relatively quickly (and also trigger
any disk health thresholds sooner rather than later).
So my take has always been to aim for doing a full scrub of all the media in a filer
within a month, and have it able to restart from the beginning and repeat the next
month, etc.
On large production filers, I’ve changed it from the default of a couple of hours once a
week to running for several early-AM hours every day; on DR filers that are mostly only
doing snapmirrors, I tend to be more aggressive - give it 8-12 hours each day and maybe
raise the scrub priority, as long as it doesn’t impact snapmirror update times (and it usually
doesn’t).
I’ve had another brand of disk array that had so much horsepower in the controllers they
would run continuous scans of every raid group, with enough smarts to immediately
give control over the disks to live I/O and pick up again when things went idle; that may
not be necessary but gave decent peace of mind.
I have seen these scrubs pick up and repair errors before but haven’t checked logs to
see how often it happens nowadays; with 10/12/14tb drives i’d expect it to happen more
often, but don’t know how true that is.
Someone let me know if any of my takes are incorrect, but I definitely don’t see a harm
in raising the schedule so that each bit gets scrubbed more often.
-dalvenjah
...
On May 31, 2019, at 12:57 PM, Sylvain Robitaille syl@encs.concordia.ca wrote:
On Thu, 30 May 2019, tmac wrote:
...
This is normal. I forget the options at this point (dont really use
them any more) but there is a default limit as to how long scrubs
will run. They remember where they left off and pick up one week
later.
Right.  I understand all that.  I was really hoping more for pointers to
documentation that would help me decide wether or not to make any
adjustments, and what to adjust _to_.  The default, at least for the
version of Ontap we're using is described in my original message (as
well as, come to think of it, which options are relevant ...).
...
It also, if I recall correctly will only do so many scrubs at the same
time.
I haven't found any documentation to that effect (though, of course it
makes sense, and I do expect that's the case).  Do you have any you can
point me to?
...
Remember, it is not just scrubbing the aggregate, but looking at each
Raid Group for consistency.
Yes, I understand that.
--
Sylvain Robitaille                               syl@encs.concordia.ca
Systems analyst / AITS                            Concordia University
Faculty of Engineering and Computer Science   Montreal, Quebec, Canada


Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: Is there a best practice for adjust RAID "scrub" options?

--