So it’s possible that someone at NetApp has done some further analysis on this; but my take on this is that this is the process that: * validates that it can read data from a disk * validates the RAID checksums * validates the WAFL block checksums * does other validity checks on the RG, aggr, filesystem, etc.
Because you don’t want problems to add up (especially, you don’t want problems to add up to the point where you discover 3 read errors in the same stripe during a rebuild!) you want to find and repair the issues relatively quickly (and also trigger any disk health thresholds sooner rather than later).
So my take has always been to aim for doing a full scrub of all the media in a filer within a month, and have it able to restart from the beginning and repeat the next month, etc.
On large production filers, I’ve changed it from the default of a couple of hours once a week to running for several early-AM hours every day; on DR filers that are mostly only doing snapmirrors, I tend to be more aggressive - give it 8-12 hours each day and maybe raise the scrub priority, as long as it doesn’t impact snapmirror update times (and it usually doesn’t).
I’ve had another brand of disk array that had so much horsepower in the controllers they would run continuous scans of every raid group, with enough smarts to immediately give control over the disks to live I/O and pick up again when things went idle; that may not be necessary but gave decent peace of mind.
I have seen these scrubs pick up and repair errors before but haven’t checked logs to see how often it happens nowadays; with 10/12/14tb drives i’d expect it to happen more often, but don’t know how true that is.
Someone let me know if any of my takes are incorrect, but I definitely don’t see a harm in raising the schedule so that each bit gets scrubbed more often.
-dalvenjah
On May 31, 2019, at 12:57 PM, Sylvain Robitaille syl@encs.concordia.ca wrote:
On Thu, 30 May 2019, tmac wrote:
This is normal. I forget the options at this point (dont really use them any more) but there is a default limit as to how long scrubs will run. They remember where they left off and pick up one week later.
Right. I understand all that. I was really hoping more for pointers to documentation that would help me decide wether or not to make any adjustments, and what to adjust _to_. The default, at least for the version of Ontap we're using is described in my original message (as well as, come to think of it, which options are relevant ...).
It also, if I recall correctly will only do so many scrubs at the same time.
I haven't found any documentation to that effect (though, of course it makes sense, and I do expect that's the case). Do you have any you can point me to?
Remember, it is not just scrubbing the aggregate, but looking at each Raid Group for consistency.
Yes, I understand that.
--
Sylvain Robitaille syl@encs.concordia.ca
Systems analyst / AITS Concordia University Faculty of Engineering and Computer Science Montreal, Quebec, Canada
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters