Is there a best practice for adjust RAID "scrub" options?

30 May 2019


      Hello NetApp Gurus.  I'm hoping mostly to get some guidance on RAID
group scrubbing options and with some luck, perhaps pointers to
documentation that would help me determine appropriate values in
our environment.
We've been seeing what feels like a large number of scrubbing
"timeouts" on our filers.  Log entries similar to this one:
May 28 05:00:02 fc1-ev-n1.console.private [kern.notice] [fc1-ev-n1:raid.scrub.suspended.timer:notice]: Disk scrub suspended because the scrub time limit 240 was exceeded. It will resume at the next weekly/scheduled scrub.
What we'd like to know is how concerned we should be about this.
Ideally we'd be seeing "scrubs" complete reasonably frequently, but I'm
honestly not sure how I could determine how frequently that happens,
or how frequently it *should* happen for that matter.
The scrub options on the filer have not been changed from the Ontap
defaults (NetApp Release 9.5P2, but we've been seeing this with
earlier versions as well):
Node    Option                  Value  Constraint
    ------- ----------------------- ------ ----------
    fc1-n1  raid.media_scrub.rate   600    only_one
    fc1-n1  raid.scrub.perf_impact  low    only_one
    fc1-n1  raid.scrub.schedule            none
(and the same for the partner node, of course)
The "storage raid-options" manual page indicates that the default
schedule of daily at 1am for 4 hours, except Sundays when it runs
for 12 hours, applies if no explicit schedule is defined.
If I examine the scrub status of our aggregates:
fc1-ev::> storage aggregate scrub -aggregate * -action status
Raid Group:/e1n2_tssd/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 02:39:10 2019
    Raid Group:/e1n2_t01/plex0/rg0, Is Suspended:true, Last Scrub:Sun May 26 02:55:15 2019, Percentage Completed:65%
    Raid Group:/e1n2_t02/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 02:28:37 2019
    Raid Group:/e1n2_root/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 03:39:03 2019
    Raid Group:/e1n1_root/plex0/rg0, Is Suspended:false, Last Scrub:Wed May 29 02:03:43 2019
    Raid Group:/e1n1_d01/plex0/rg0, Is Suspended:true, Last Scrub:Tue May 28 03:22:52 2019, Percentage Completed:7%
    Raid Group:/e1n1_d01/plex0/rg1, Is Suspended:true, Last Scrub:Wed May 29 02:00:56 2019, Percentage Completed:4%
    Raid Group:/e1n1_d01/plex0/rg2, Is Suspended:false, Last Scrub:Wed May 29 04:02:50 2019
    Raid Group:/e1n1_d01/plex0/rg3, Is Suspended:false, Last Scrub:Wed May 29 04:08:45 2019
    Raid Group:/e1n1_d00/plex0/rg0, Is Suspended:true, Last Scrub:Sun Apr 28 06:00:40 2019, Percentage Completed:81%
    Raid Group:/e1n1_d00/plex0/rg1, Is Suspended:true, Last Scrub:Sun Apr 28 07:38:30 2019, Percentage Completed:80%
The truth is I'm not sure how to interpret this output:
- Is it the case that each RAID group where "Is Suspended:false"
     *completed* its scrub at the "Last Scrub" time, while those that
     are suspended are those for which we're seeing log entries?
- Given the default schedule that has the scrub run for 12 hours
     on Sunday mornings, does it seem odd that /e1n2_t01/plex0/rg0 was
     suspended last Sunday at 02:55:15, prior to completion?  In fact,
     all those interrupted on a Sunday were interrupted well before
     12 hours.  Might there be other reasons for suspending scrub
     operations?  The load on this filer is not excessive in any way:
     CPU utilization is typically comfortably below 50%
- How do I determine why the two RAID groups containing e1n1_d00
     haven't run scrubbing in over a month?  Is there something I
     should do about that?
I've found documentation that explains the options and how to change
them, but none that explains how to decide whether I *should* change
them, or how to determine what to change them to.  I'm interpretting
that raid.media_scrub.rate and raid.scrub.schedule could be used
together to tune the scrubbing, but am quite unsure how to determine
what the best values would be for our filers.  Any pointers to
documentation that would help here would be hugely appreciated.
Thanks in advance ...
-- 
----------------------------------------------------------------------
Sylvain Robitaille                               syl@encs.concordia.ca

Systems analyst / AITS                            Concordia University
Faculty of Engineering and Computer Science   Montreal, Quebec, Canada
----------------------------------------------------------------------

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Is there a best practice for adjust RAID "scrub" options?