Re: Is there a best practice for adjust RAID "scrub" options?

30 May 2019

      This is normal. I forget the options at this point (dont really use them
any more) but there is a default limit as to how long scrubs will run. They
remember where they left off and pick up one week later. It also, if I
recall correctly will only do so many scrubs at the same time. Remember, it
is not just scrubbing the aggregate, but looking at each Raid Group for
consistency.
--tmac
On Thu, May 30, 2019 at 1:19 PM Sylvain Robitaille syl@encs.concordia.ca
wrote:
...
Hello NetApp Gurus.  I'm hoping mostly to get some guidance on RAID
group scrubbing options and with some luck, perhaps pointers to
documentation that would help me determine appropriate values in
our environment.
We've been seeing what feels like a large number of scrubbing
"timeouts" on our filers.  Log entries similar to this one:
May 28 05:00:02 fc1-ev-n1.console.private [kern.notice]

[fc1-ev-n1:raid.scrub.suspended.timer:notice]: Disk scrub suspended because
the scrub time limit 240 was exceeded. It will resume at the next
weekly/scheduled scrub.
What we'd like to know is how concerned we should be about this.
Ideally we'd be seeing "scrubs" complete reasonably frequently, but I'm
honestly not sure how I could determine how frequently that happens,
or how frequently it *should* happen for that matter.
The scrub options on the filer have not been changed from the Ontap
defaults (NetApp Release 9.5P2, but we've been seeing this with
earlier versions as well):
Node    Option                  Value  Constraint
------- ----------------------- ------ ----------
fc1-n1  raid.media_scrub.rate   600    only_one
fc1-n1  raid.scrub.perf_impact  low    only_one
fc1-n1  raid.scrub.schedule            none

(and the same for the partner node, of course)
The "storage raid-options" manual page indicates that the default
schedule of daily at 1am for 4 hours, except Sundays when it runs
for 12 hours, applies if no explicit schedule is defined.
If I examine the scrub status of our aggregates:
fc1-ev::> storage aggregate scrub -aggregate * -action status

Raid Group:/e1n2_tssd/plex0/rg0, Is Suspended:false, Last Scrub:Thu

May 30 02:39:10 2019
    Raid Group:/e1n2_t01/plex0/rg0, Is Suspended:true, Last Scrub:Sun May
26 02:55:15 2019, Percentage Completed:65%
    Raid Group:/e1n2_t02/plex0/rg0, Is Suspended:false, Last Scrub:Thu May
30 02:28:37 2019
    Raid Group:/e1n2_root/plex0/rg0, Is Suspended:false, Last Scrub:Thu
May 30 03:39:03 2019
    Raid Group:/e1n1_root/plex0/rg0, Is Suspended:false, Last Scrub:Wed
May 29 02:03:43 2019
    Raid Group:/e1n1_d01/plex0/rg0, Is Suspended:true, Last Scrub:Tue May
28 03:22:52 2019, Percentage Completed:7%
    Raid Group:/e1n1_d01/plex0/rg1, Is Suspended:true, Last Scrub:Wed May
29 02:00:56 2019, Percentage Completed:4%
    Raid Group:/e1n1_d01/plex0/rg2, Is Suspended:false, Last Scrub:Wed May
29 04:02:50 2019
    Raid Group:/e1n1_d01/plex0/rg3, Is Suspended:false, Last Scrub:Wed May
29 04:08:45 2019
    Raid Group:/e1n1_d00/plex0/rg0, Is Suspended:true, Last Scrub:Sun Apr
28 06:00:40 2019, Percentage Completed:81%
    Raid Group:/e1n1_d00/plex0/rg1, Is Suspended:true, Last Scrub:Sun Apr
28 07:38:30 2019, Percentage Completed:80%
The truth is I'm not sure how to interpret this output:

Is it the case that each RAID group where "Is Suspended:false"
*completed* its scrub at the "Last Scrub" time, while those that
are suspended are those for which we're seeing log entries?

Given the default schedule that has the scrub run for 12 hours
on Sunday mornings, does it seem odd that /e1n2_t01/plex0/rg0 was
suspended last Sunday at 02:55:15, prior to completion?  In fact,
all those interrupted on a Sunday were interrupted well before
12 hours.  Might there be other reasons for suspending scrub
operations?  The load on this filer is not excessive in any way:
CPU utilization is typically comfortably below 50%

How do I determine why the two RAID groups containing e1n1_d00
haven't run scrubbing in over a month?  Is there something I
should do about that?

I've found documentation that explains the options and how to change
them, but none that explains how to decide whether I *should* change
them, or how to determine what to change them to.  I'm interpretting
that raid.media_scrub.rate and raid.scrub.schedule could be used
together to tune the scrubbing, but am quite unsure how to determine
what the best values would be for our filers.  Any pointers to
documentation that would help here would be hugely appreciated.
Thanks in advance ...
--
Sylvain Robitaille                               syl@encs.concordia.ca
Systems analyst / AITS                            Concordia University
Faculty of Engineering and Computer Science   Montreal, Quebec, Canada

Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: Is there a best practice for adjust RAID "scrub" options?

--