This is normal. I forget the options at this point (dont really use them any more) but there is a default limit as to how long scrubs will run. They remember where they left off and pick up one week later. It also, if I recall correctly will only do so many scrubs at the same time. Remember, it is not just scrubbing the aggregate, but looking at each Raid Group for consistency.

--tmac




On Thu, May 30, 2019 at 1:19 PM Sylvain Robitaille <syl@encs.concordia.ca> wrote:

Hello NetApp Gurus.  I'm hoping mostly to get some guidance on RAID
group scrubbing options and with some luck, perhaps pointers to
documentation that would help me determine appropriate values in
our environment.

We've been seeing what feels like a large number of scrubbing
"timeouts" on our filers.  Log entries similar to this one:

    May 28 05:00:02 fc1-ev-n1.console.private [kern.notice] [fc1-ev-n1:raid.scrub.suspended.timer:notice]: Disk scrub suspended because the scrub time limit 240 was exceeded. It will resume at the next weekly/scheduled scrub.

What we'd like to know is how concerned we should be about this.
Ideally we'd be seeing "scrubs" complete reasonably frequently, but I'm
honestly not sure how I could determine how frequently that happens,
or how frequently it *should* happen for that matter.

The scrub options on the filer have not been changed from the Ontap
defaults (NetApp Release 9.5P2, but we've been seeing this with
earlier versions as well):

    Node    Option                  Value  Constraint
    ------- ----------------------- ------ ----------
    fc1-n1  raid.media_scrub.rate   600    only_one
    fc1-n1  raid.scrub.perf_impact  low    only_one
    fc1-n1  raid.scrub.schedule            none

(and the same for the partner node, of course)

The "storage raid-options" manual page indicates that the default
schedule of daily at 1am for 4 hours, except Sundays when it runs
for 12 hours, applies if no explicit schedule is defined.

If I examine the scrub status of our aggregates:

    fc1-ev::> storage aggregate scrub -aggregate * -action status

    Raid Group:/e1n2_tssd/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 02:39:10 2019
    Raid Group:/e1n2_t01/plex0/rg0, Is Suspended:true, Last Scrub:Sun May 26 02:55:15 2019, Percentage Completed:65%
    Raid Group:/e1n2_t02/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 02:28:37 2019
    Raid Group:/e1n2_root/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 03:39:03 2019
    Raid Group:/e1n1_root/plex0/rg0, Is Suspended:false, Last Scrub:Wed May 29 02:03:43 2019
    Raid Group:/e1n1_d01/plex0/rg0, Is Suspended:true, Last Scrub:Tue May 28 03:22:52 2019, Percentage Completed:7%
    Raid Group:/e1n1_d01/plex0/rg1, Is Suspended:true, Last Scrub:Wed May 29 02:00:56 2019, Percentage Completed:4%
    Raid Group:/e1n1_d01/plex0/rg2, Is Suspended:false, Last Scrub:Wed May 29 04:02:50 2019
    Raid Group:/e1n1_d01/plex0/rg3, Is Suspended:false, Last Scrub:Wed May 29 04:08:45 2019
    Raid Group:/e1n1_d00/plex0/rg0, Is Suspended:true, Last Scrub:Sun Apr 28 06:00:40 2019, Percentage Completed:81%
    Raid Group:/e1n1_d00/plex0/rg1, Is Suspended:true, Last Scrub:Sun Apr 28 07:38:30 2019, Percentage Completed:80%

The truth is I'm not sure how to interpret this output:

   - Is it the case that each RAID group where "Is Suspended:false"
     *completed* its scrub at the "Last Scrub" time, while those that
     are suspended are those for which we're seeing log entries?

   - Given the default schedule that has the scrub run for 12 hours
     on Sunday mornings, does it seem odd that /e1n2_t01/plex0/rg0 was
     suspended last Sunday at 02:55:15, prior to completion?  In fact,
     all those interrupted on a Sunday were interrupted well before
     12 hours.  Might there be other reasons for suspending scrub
     operations?  The load on this filer is not excessive in any way:
     CPU utilization is typically comfortably below 50%

   - How do I determine why the two RAID groups containing e1n1_d00
     haven't run scrubbing in over a month?  Is there something I
     should do about that?

I've found documentation that explains the options and how to change
them, but none that explains how to decide whether I *should* change
them, or how to determine what to change them to.  I'm interpretting
that raid.media_scrub.rate and raid.scrub.schedule could be used
together to tune the scrubbing, but am quite unsure how to determine
what the best values would be for our filers.  Any pointers to
documentation that would help here would be hugely appreciated.

Thanks in advance ...

--
----------------------------------------------------------------------
Sylvain Robitaille                               syl@encs.concordia.ca

Systems analyst / AITS                            Concordia University
Faculty of Engineering and Computer Science   Montreal, Quebec, Canada
----------------------------------------------------------------------
_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters