So, looks like best to utilize the "storage raid-options" command to manipulate the scrubbing in ONTAP. Here are some options (from 9.6, YMMV with versions before ONTAP 9.6!)

 raid.media_scrub.enable
This option enables/disables continuous background media scrubs for all the aggregates in the system. Valid values are on and off. The default value is on. When enabled, a low-overhead version of scrub that checks only for media errors runs continuously on all aggregates in the system. Background media scrub has a negligible performance impact on the user workload and this is achieved by aggressive disk and CPU throttling.
raid.media_scrub.rate
This option sets the rate of media scrub on an aggregate. Valid values for this option range from 300 to 3000 where a rate of 300 represents a media scrub of approximately 512 MB per hour, and 3000 represents a media scrub of approximately 5GB per hour. The default value for this option is 600, which is a rate of approximately 1GB per hour
raid.scrub.duration
This options sets the duration of automatically started scrubs, in minutes. If this is not set or set to 0, the default duration is 4 hours (240 minutes). If set to -1, all automatic scrubs run to completion.
raid.scrub.enable
This option enables/disables the RAID scrub feature. Valid values are on or off. The default value is on. This option only affects the scrubbing process that gets started from cron. This option is ignored for user-requested scrubs.
raid.scrub.perf_impact
This option sets the overall performance impact of RAID scrubbing (whether started automatically or manually). When the CPU and disk bandwidth are not consumed by serving clients, scrubbing consumes as much bandwidth as it needs. If the serving of clients is already consuming most or all of the CPU and disk bandwidth, this option allows control over the CPU and disk bandwidth that can be taken away for scrubbing, and thereby enables control over the negative performance impact on the serving of clients. As the value of this option is increased, the speed of scrubbing also increases. The possible values for this option are low, medium, and high. The default value is low. When scrub and mirror verify are running at the same time, the system does not distinguish between their separate resource consumption on shared resources (like CPU or a shared disk). In this case, the combined resource utilization of these operations is limited to the maximum resource entitlement for individual operations.
raid.scrub.schedule
This option specifies the weekly schedule (day, time and duration) for scrubs started automatically by the raid.scrub.enable option. On a non-AFF system, the default schedule is daily at 1 a.m. for the duration of 4 hours except on Sunday when it is 12 hours. On an AFF system, the default schedule is weekly at 1 a.m. on Sunday for the duration of 6 hours. If an empty string ("") is specified as an argument, it will delete the previous scrub schedule and add the default schedule. One or more schedules can be specified using this option. The syntax is duration[h|m]@weekday@start_time,[duration[h|m]@weekday@start_time,...] where duration is the time period for which scrub operation is allowed to run, in hours or minutes ('h' or 'm' respectively).If duration is not specified, the raid.scrub.duration option value will be used as duration for the schedule.

Weekday is the day on which the scrub is scheduled to start. The valid values are sun, mon, tue, wed, thu, fri, sat.

start_time is the time when scrub is schedule to start. It is specified in 24 hour format. Only the hour (0-23) needs to be specified.

For example, options raid.scrub.schedule 240m@tue@2,8h@sat@22 will cause scrub to start on every Tuesday at 2 a.m. for 240 minutes, and on every Saturday at 10 p.m. for 480 minutes.

--tmac

Tim McCarthy, Principal Consultant

Proud Member of the #NetAppATeam

I Blog at TMACsRack




On Fri, May 31, 2019 at 4:36 PM Dalvenjah FoxFire <dalvenjah@dal.net> wrote:
So it’s possible that someone at NetApp has done some further analysis on this;
but my take on this is that this is the process that:
* validates that it can read data from a disk
* validates the RAID checksums
* validates the WAFL block checksums
* does other validity checks on the RG, aggr, filesystem, etc.

Because you don’t want problems to add up (especially, you don’t want problems
to add up to the point where you discover 3 read errors in the same stripe during
a rebuild!) you want to find and repair the issues relatively quickly (and also trigger
any disk health thresholds sooner rather than later).

So my take has always been to aim for doing a full scrub of all the media in a filer
within a month, and have it able to restart from the beginning and repeat the next
month, etc.

On large production filers, I’ve changed it from the default of a couple of hours once a
week to running for several early-AM hours every day; on DR filers that are mostly only
doing snapmirrors, I tend to be more aggressive - give it 8-12 hours each day and maybe
raise the scrub priority, as long as it doesn’t impact snapmirror update times (and it usually
doesn’t).

I’ve had another brand of disk array that had so much horsepower in the controllers they
would run continuous scans of every raid group, with enough smarts to immediately
give control over the disks to live I/O and pick up again when things went idle; that may
not be necessary but gave decent peace of mind.

I have seen these scrubs pick up and repair errors before but haven’t checked logs to
see how often it happens nowadays; with 10/12/14tb drives i’d expect it to happen more
often, but don’t know how true that is.

Someone let me know if any of my takes are incorrect, but I definitely don’t see a harm
in raising the schedule so that each bit gets scrubbed more often.

-dalvenjah

> On May 31, 2019, at 12:57 PM, Sylvain Robitaille <syl@encs.concordia.ca> wrote:
>
> On Thu, 30 May 2019, tmac wrote:
>
>> This is normal. I forget the options at this point (dont really use
>> them any more) but there is a default limit as to how long scrubs
>> will run. They remember where they left off and pick up one week
>> later.
>
> Right.  I understand all that.  I was really hoping more for pointers to
> documentation that would help me decide wether or not to make any
> adjustments, and what to adjust _to_.  The default, at least for the
> version of Ontap we're using is described in my original message (as
> well as, come to think of it, which options are relevant ...).
>
>> It also, if I recall correctly will only do so many scrubs at the same
>> time.
>
> I haven't found any documentation to that effect (though, of course it
> makes sense, and I do expect that's the case).  Do you have any you can
> point me to?
>
>> Remember, it is not just scrubbing the aggregate, but looking at each
>> Raid Group for consistency.
>
> Yes, I understand that.
>
> --
> ----------------------------------------------------------------------
> Sylvain Robitaille                               syl@encs.concordia.ca
>
> Systems analyst / AITS                            Concordia University
> Faculty of Engineering and Computer Science   Montreal, Quebec, Canada
> ----------------------------------------------------------------------
> _______________________________________________
> Toasters mailing list
> Toasters@teaparty.net
> http://www.teaparty.net/mailman/listinfo/toasters


_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters