On Thu, May 30, 2019 at 1:19 PM Sylvain Robitaille <syl@encs.concordia.ca> wrote:

Hello NetApp Gurus. I'm hoping mostly to get some guidance on RAID
group scrubbing options and with some luck, perhaps pointers to
documentation that would help me determine appropriate values in
our environment.

We've been seeing what feels like a large number of scrubbing
"timeouts" on our filers. Log entries similar to this one:

May 28 05:00:02 fc1-ev-n1.console.private [kern.notice] [fc1-ev-n1:raid.scrub.suspended.timer:notice]: Disk scrub suspended because the scrub time limit 240 was exceeded. It will resume at the next weekly/scheduled scrub.

What we'd like to know is how concerned we should be about this.
Ideally we'd be seeing "scrubs" complete reasonably frequently, but I'm
honestly not sure how I could determine how frequently that happens,
or how frequently it *should* happen for that matter.

The scrub options on the filer have not been changed from the Ontap
defaults (NetApp Release 9.5P2, but we've been seeing this with
earlier versions as well):

Node Option Value Constraint
------- ----------------------- ------ ----------
fc1-n1 raid.media_scrub.rate 600 only_one
fc1-n1 raid.scrub.perf_impact low only_one
fc1-n1 raid.scrub.schedule none

(and the same for the partner node, of course)

The "storage raid-options" manual page indicates that the default
schedule of daily at 1am for 4 hours, except Sundays when it runs
for 12 hours, applies if no explicit schedule is defined.

If I examine the scrub status of our aggregates:

fc1-ev::> storage aggregate scrub -aggregate * -action status

Raid Group:/e1n2_tssd/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 02:39:10 2019
Raid Group:/e1n2_t01/plex0/rg0, Is Suspended:true, Last Scrub:Sun May 26 02:55:15 2019, Percentage Completed:65%
Raid Group:/e1n2_t02/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 02:28:37 2019
Raid Group:/e1n2_root/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 03:39:03 2019
Raid Group:/e1n1_root/plex0/rg0, Is Suspended:false, Last Scrub:Wed May 29 02:03:43 2019
Raid Group:/e1n1_d01/plex0/rg0, Is Suspended:true, Last Scrub:Tue May 28 03:22:52 2019, Percentage Completed:7%
Raid Group:/e1n1_d01/plex0/rg1, Is Suspended:true, Last Scrub:Wed May 29 02:00:56 2019, Percentage Completed:4%
Raid Group:/e1n1_d01/plex0/rg2, Is Suspended:false, Last Scrub:Wed May 29 04:02:50 2019
Raid Group:/e1n1_d01/plex0/rg3, Is Suspended:false, Last Scrub:Wed May 29 04:08:45 2019
Raid Group:/e1n1_d00/plex0/rg0, Is Suspended:true, Last Scrub:Sun Apr 28 06:00:40 2019, Percentage Completed:81%
Raid Group:/e1n1_d00/plex0/rg1, Is Suspended:true, Last Scrub:Sun Apr 28 07:38:30 2019, Percentage Completed:80%

The truth is I'm not sure how to interpret this output:

- Is it the case that each RAID group where "Is Suspended:false"
*completed* its scrub at the "Last Scrub" time, while those that
are suspended are those for which we're seeing log entries?

- Given the default schedule that has the scrub run for 12 hours
on Sunday mornings, does it seem odd that /e1n2_t01/plex0/rg0 was
suspended last Sunday at 02:55:15, prior to completion? In fact,
all those interrupted on a Sunday were interrupted well before
12 hours. Might there be other reasons for suspending scrub
operations? The load on this filer is not excessive in any way:
CPU utilization is typically comfortably below 50%

- How do I determine why the two RAID groups containing e1n1_d00
haven't run scrubbing in over a month? Is there something I
should do about that?

I've found documentation that explains the options and how to change
them, but none that explains how to decide whether I *should* change
them, or how to determine what to change them to. I'm interpretting
that raid.media_scrub.rate and raid.scrub.schedule could be used
together to tune the scrubbing, but am quite unsure how to determine
what the best values would be for our filers. Any pointers to
documentation that would help here would be hugely appreciated.

Thanks in advance ...

--
----------------------------------------------------------------------
Sylvain Robitaille syl@encs.concordia.ca

Systems analyst / AITS Concordia University
Faculty of Engineering and Computer Science Montreal, Quebec, Canada
----------------------------------------------------------------------
_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters