Hello NetApp Gurus. I'm hoping mostly to get some guidance on RAID group scrubbing options and with some luck, perhaps pointers to documentation that would help me determine appropriate values in our environment.
We've been seeing what feels like a large number of scrubbing "timeouts" on our filers. Log entries similar to this one:
May 28 05:00:02 fc1-ev-n1.console.private [kern.notice] [fc1-ev-n1:raid.scrub.suspended.timer:notice]: Disk scrub suspended because the scrub time limit 240 was exceeded. It will resume at the next weekly/scheduled scrub.
What we'd like to know is how concerned we should be about this. Ideally we'd be seeing "scrubs" complete reasonably frequently, but I'm honestly not sure how I could determine how frequently that happens, or how frequently it *should* happen for that matter.
The scrub options on the filer have not been changed from the Ontap defaults (NetApp Release 9.5P2, but we've been seeing this with earlier versions as well):
Node Option Value Constraint ------- ----------------------- ------ ---------- fc1-n1 raid.media_scrub.rate 600 only_one fc1-n1 raid.scrub.perf_impact low only_one fc1-n1 raid.scrub.schedule none
(and the same for the partner node, of course)
The "storage raid-options" manual page indicates that the default schedule of daily at 1am for 4 hours, except Sundays when it runs for 12 hours, applies if no explicit schedule is defined.
If I examine the scrub status of our aggregates:
fc1-ev::> storage aggregate scrub -aggregate * -action status
Raid Group:/e1n2_tssd/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 02:39:10 2019 Raid Group:/e1n2_t01/plex0/rg0, Is Suspended:true, Last Scrub:Sun May 26 02:55:15 2019, Percentage Completed:65% Raid Group:/e1n2_t02/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 02:28:37 2019 Raid Group:/e1n2_root/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 03:39:03 2019 Raid Group:/e1n1_root/plex0/rg0, Is Suspended:false, Last Scrub:Wed May 29 02:03:43 2019 Raid Group:/e1n1_d01/plex0/rg0, Is Suspended:true, Last Scrub:Tue May 28 03:22:52 2019, Percentage Completed:7% Raid Group:/e1n1_d01/plex0/rg1, Is Suspended:true, Last Scrub:Wed May 29 02:00:56 2019, Percentage Completed:4% Raid Group:/e1n1_d01/plex0/rg2, Is Suspended:false, Last Scrub:Wed May 29 04:02:50 2019 Raid Group:/e1n1_d01/plex0/rg3, Is Suspended:false, Last Scrub:Wed May 29 04:08:45 2019 Raid Group:/e1n1_d00/plex0/rg0, Is Suspended:true, Last Scrub:Sun Apr 28 06:00:40 2019, Percentage Completed:81% Raid Group:/e1n1_d00/plex0/rg1, Is Suspended:true, Last Scrub:Sun Apr 28 07:38:30 2019, Percentage Completed:80%
The truth is I'm not sure how to interpret this output:
- Is it the case that each RAID group where "Is Suspended:false" *completed* its scrub at the "Last Scrub" time, while those that are suspended are those for which we're seeing log entries?
- Given the default schedule that has the scrub run for 12 hours on Sunday mornings, does it seem odd that /e1n2_t01/plex0/rg0 was suspended last Sunday at 02:55:15, prior to completion? In fact, all those interrupted on a Sunday were interrupted well before 12 hours. Might there be other reasons for suspending scrub operations? The load on this filer is not excessive in any way: CPU utilization is typically comfortably below 50%
- How do I determine why the two RAID groups containing e1n1_d00 haven't run scrubbing in over a month? Is there something I should do about that?
I've found documentation that explains the options and how to change them, but none that explains how to decide whether I *should* change them, or how to determine what to change them to. I'm interpretting that raid.media_scrub.rate and raid.scrub.schedule could be used together to tune the scrubbing, but am quite unsure how to determine what the best values would be for our filers. Any pointers to documentation that would help here would be hugely appreciated.
Thanks in advance ...
This is normal. I forget the options at this point (dont really use them any more) but there is a default limit as to how long scrubs will run. They remember where they left off and pick up one week later. It also, if I recall correctly will only do so many scrubs at the same time. Remember, it is not just scrubbing the aggregate, but looking at each Raid Group for consistency.
--tmac
On Thu, May 30, 2019 at 1:19 PM Sylvain Robitaille syl@encs.concordia.ca wrote:
Hello NetApp Gurus. I'm hoping mostly to get some guidance on RAID group scrubbing options and with some luck, perhaps pointers to documentation that would help me determine appropriate values in our environment.
We've been seeing what feels like a large number of scrubbing "timeouts" on our filers. Log entries similar to this one:
May 28 05:00:02 fc1-ev-n1.console.private [kern.notice]
[fc1-ev-n1:raid.scrub.suspended.timer:notice]: Disk scrub suspended because the scrub time limit 240 was exceeded. It will resume at the next weekly/scheduled scrub.
What we'd like to know is how concerned we should be about this. Ideally we'd be seeing "scrubs" complete reasonably frequently, but I'm honestly not sure how I could determine how frequently that happens, or how frequently it *should* happen for that matter.
The scrub options on the filer have not been changed from the Ontap defaults (NetApp Release 9.5P2, but we've been seeing this with earlier versions as well):
Node Option Value Constraint ------- ----------------------- ------ ---------- fc1-n1 raid.media_scrub.rate 600 only_one fc1-n1 raid.scrub.perf_impact low only_one fc1-n1 raid.scrub.schedule none
(and the same for the partner node, of course)
The "storage raid-options" manual page indicates that the default schedule of daily at 1am for 4 hours, except Sundays when it runs for 12 hours, applies if no explicit schedule is defined.
If I examine the scrub status of our aggregates:
fc1-ev::> storage aggregate scrub -aggregate * -action status Raid Group:/e1n2_tssd/plex0/rg0, Is Suspended:false, Last Scrub:Thu
May 30 02:39:10 2019 Raid Group:/e1n2_t01/plex0/rg0, Is Suspended:true, Last Scrub:Sun May 26 02:55:15 2019, Percentage Completed:65% Raid Group:/e1n2_t02/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 02:28:37 2019 Raid Group:/e1n2_root/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 03:39:03 2019 Raid Group:/e1n1_root/plex0/rg0, Is Suspended:false, Last Scrub:Wed May 29 02:03:43 2019 Raid Group:/e1n1_d01/plex0/rg0, Is Suspended:true, Last Scrub:Tue May 28 03:22:52 2019, Percentage Completed:7% Raid Group:/e1n1_d01/plex0/rg1, Is Suspended:true, Last Scrub:Wed May 29 02:00:56 2019, Percentage Completed:4% Raid Group:/e1n1_d01/plex0/rg2, Is Suspended:false, Last Scrub:Wed May 29 04:02:50 2019 Raid Group:/e1n1_d01/plex0/rg3, Is Suspended:false, Last Scrub:Wed May 29 04:08:45 2019 Raid Group:/e1n1_d00/plex0/rg0, Is Suspended:true, Last Scrub:Sun Apr 28 06:00:40 2019, Percentage Completed:81% Raid Group:/e1n1_d00/plex0/rg1, Is Suspended:true, Last Scrub:Sun Apr 28 07:38:30 2019, Percentage Completed:80%
The truth is I'm not sure how to interpret this output:
Is it the case that each RAID group where "Is Suspended:false" *completed* its scrub at the "Last Scrub" time, while those that are suspended are those for which we're seeing log entries?
Given the default schedule that has the scrub run for 12 hours on Sunday mornings, does it seem odd that /e1n2_t01/plex0/rg0 was suspended last Sunday at 02:55:15, prior to completion? In fact, all those interrupted on a Sunday were interrupted well before 12 hours. Might there be other reasons for suspending scrub operations? The load on this filer is not excessive in any way: CPU utilization is typically comfortably below 50%
How do I determine why the two RAID groups containing e1n1_d00 haven't run scrubbing in over a month? Is there something I should do about that?
I've found documentation that explains the options and how to change them, but none that explains how to decide whether I *should* change them, or how to determine what to change them to. I'm interpretting that raid.media_scrub.rate and raid.scrub.schedule could be used together to tune the scrubbing, but am quite unsure how to determine what the best values would be for our filers. Any pointers to documentation that would help here would be hugely appreciated.
Thanks in advance ...
--
Sylvain Robitaille syl@encs.concordia.ca
Systems analyst / AITS Concordia University Faculty of Engineering and Computer Science Montreal, Quebec, Canada
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
On Thu, 30 May 2019, tmac wrote:
This is normal. I forget the options at this point (dont really use them any more) but there is a default limit as to how long scrubs will run. They remember where they left off and pick up one week later.
Right. I understand all that. I was really hoping more for pointers to documentation that would help me decide wether or not to make any adjustments, and what to adjust _to_. The default, at least for the version of Ontap we're using is described in my original message (as well as, come to think of it, which options are relevant ...).
It also, if I recall correctly will only do so many scrubs at the same time.
I haven't found any documentation to that effect (though, of course it makes sense, and I do expect that's the case). Do you have any you can point me to?
Remember, it is not just scrubbing the aggregate, but looking at each Raid Group for consistency.
Yes, I understand that.
So it’s possible that someone at NetApp has done some further analysis on this; but my take on this is that this is the process that: * validates that it can read data from a disk * validates the RAID checksums * validates the WAFL block checksums * does other validity checks on the RG, aggr, filesystem, etc.
Because you don’t want problems to add up (especially, you don’t want problems to add up to the point where you discover 3 read errors in the same stripe during a rebuild!) you want to find and repair the issues relatively quickly (and also trigger any disk health thresholds sooner rather than later).
So my take has always been to aim for doing a full scrub of all the media in a filer within a month, and have it able to restart from the beginning and repeat the next month, etc.
On large production filers, I’ve changed it from the default of a couple of hours once a week to running for several early-AM hours every day; on DR filers that are mostly only doing snapmirrors, I tend to be more aggressive - give it 8-12 hours each day and maybe raise the scrub priority, as long as it doesn’t impact snapmirror update times (and it usually doesn’t).
I’ve had another brand of disk array that had so much horsepower in the controllers they would run continuous scans of every raid group, with enough smarts to immediately give control over the disks to live I/O and pick up again when things went idle; that may not be necessary but gave decent peace of mind.
I have seen these scrubs pick up and repair errors before but haven’t checked logs to see how often it happens nowadays; with 10/12/14tb drives i’d expect it to happen more often, but don’t know how true that is.
Someone let me know if any of my takes are incorrect, but I definitely don’t see a harm in raising the schedule so that each bit gets scrubbed more often.
-dalvenjah
On May 31, 2019, at 12:57 PM, Sylvain Robitaille syl@encs.concordia.ca wrote:
On Thu, 30 May 2019, tmac wrote:
This is normal. I forget the options at this point (dont really use them any more) but there is a default limit as to how long scrubs will run. They remember where they left off and pick up one week later.
Right. I understand all that. I was really hoping more for pointers to documentation that would help me decide wether or not to make any adjustments, and what to adjust _to_. The default, at least for the version of Ontap we're using is described in my original message (as well as, come to think of it, which options are relevant ...).
It also, if I recall correctly will only do so many scrubs at the same time.
I haven't found any documentation to that effect (though, of course it makes sense, and I do expect that's the case). Do you have any you can point me to?
Remember, it is not just scrubbing the aggregate, but looking at each Raid Group for consistency.
Yes, I understand that.
--
Sylvain Robitaille syl@encs.concordia.ca
Systems analyst / AITS Concordia University Faculty of Engineering and Computer Science Montreal, Quebec, Canada
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
So, looks like best to utilize the "storage raid-options" command to manipulate the scrubbing in ONTAP. Here are some options (from 9.6, YMMV with versions before ONTAP 9.6!)
* raid.media_scrub.enable*
This option enables/disables continuous background media scrubs for all the aggregates in the system. Valid values are on and off. The default value is on. When enabled, a low-overhead version of scrub that checks only for media errors runs continuously on all aggregates in the system. Background media scrub has a negligible performance impact on the user workload and this is achieved by aggressive disk and CPU throttling.
*raid.media_scrub.rate*
This option sets the rate of media scrub on an aggregate. Valid values for this option range from 300 to 3000 where a rate of 300 represents a media scrub of approximately 512 MB per hour, and 3000 represents a media scrub of approximately 5GB per hour. The default value for this option is 600, which is a rate of approximately 1GB per hour
*raid.scrub.duration*
This options sets the duration of automatically started scrubs, in minutes. If this is not set or set to 0, the default duration is 4 hours (240 minutes). If set to -1, all automatic scrubs run to completion.
*raid.scrub.enable*
This option enables/disables the RAID scrub feature. Valid values are on or off. The default value is on. This option only affects the scrubbing process that gets started from cron. This option is ignored for user-requested scrubs.
*raid.scrub.perf_impact*
This option sets the overall performance impact of RAID scrubbing (whether started automatically or manually). When the CPU and disk bandwidth are not consumed by serving clients, scrubbing consumes as much bandwidth as it needs. If the serving of clients is already consuming most or all of the CPU and disk bandwidth, this option allows control over the CPU and disk bandwidth that can be taken away for scrubbing, and thereby enables control over the negative performance impact on the serving of clients. As the value of this option is increased, the speed of scrubbing also increases. The possible values for this option are low, medium, and high. The default value is low. When scrub and mirror verify are running at the same time, the system does not distinguish between their separate resource consumption on shared resources (like CPU or a shared disk). In this case, the combined resource utilization of these operations is limited to the maximum resource entitlement for individual operations.
*raid.scrub.schedule*
This option specifies the weekly schedule (day, time and duration) for scrubs started automatically by the raid.scrub.enable option. On a non-AFF system, the default schedule is daily at 1 a.m. for the duration of 4 hours except on Sunday when it is 12 hours. On an AFF system, the default schedule is weekly at 1 a.m. on Sunday for the duration of 6 hours. If an empty string ("") is specified as an argument, it will delete the previous scrub schedule and add the default schedule. One or more schedules can be specified using this option. The syntax is duration[h|m]@weekday@start_time ,[duration[h|m]@weekday@start_time,...] where duration is the time period for which scrub operation is allowed to run, in hours or minutes ('h' or 'm' respectively).If duration is not specified, the raid.scrub.duration option value will be used as duration for the schedule.
Weekday is the day on which the scrub is scheduled to start. The valid values are sun, mon, tue, wed, thu, fri, sat.
start_time is the time when scrub is schedule to start. It is specified in 24 hour format. Only the hour (0-23) needs to be specified.
For example, options raid.scrub.schedule 240m@tue@2,8h@sat@22 will cause scrub to start on every Tuesday at 2 a.m. for 240 minutes, and on every Saturday at 10 p.m. for 480 minutes.
--tmac
*Tim McCarthy, **Principal Consultant*
*Proud Member of the #NetAppATeam https://twitter.com/NetAppATeam*
*I Blog at TMACsRack https://tmacsrack.wordpress.com/*
On Fri, May 31, 2019 at 4:36 PM Dalvenjah FoxFire dalvenjah@dal.net wrote:
So it’s possible that someone at NetApp has done some further analysis on this; but my take on this is that this is the process that:
- validates that it can read data from a disk
- validates the RAID checksums
- validates the WAFL block checksums
- does other validity checks on the RG, aggr, filesystem, etc.
Because you don’t want problems to add up (especially, you don’t want problems to add up to the point where you discover 3 read errors in the same stripe during a rebuild!) you want to find and repair the issues relatively quickly (and also trigger any disk health thresholds sooner rather than later).
So my take has always been to aim for doing a full scrub of all the media in a filer within a month, and have it able to restart from the beginning and repeat the next month, etc.
On large production filers, I’ve changed it from the default of a couple of hours once a week to running for several early-AM hours every day; on DR filers that are mostly only doing snapmirrors, I tend to be more aggressive - give it 8-12 hours each day and maybe raise the scrub priority, as long as it doesn’t impact snapmirror update times (and it usually doesn’t).
I’ve had another brand of disk array that had so much horsepower in the controllers they would run continuous scans of every raid group, with enough smarts to immediately give control over the disks to live I/O and pick up again when things went idle; that may not be necessary but gave decent peace of mind.
I have seen these scrubs pick up and repair errors before but haven’t checked logs to see how often it happens nowadays; with 10/12/14tb drives i’d expect it to happen more often, but don’t know how true that is.
Someone let me know if any of my takes are incorrect, but I definitely don’t see a harm in raising the schedule so that each bit gets scrubbed more often.
-dalvenjah
On May 31, 2019, at 12:57 PM, Sylvain Robitaille syl@encs.concordia.ca
wrote:
On Thu, 30 May 2019, tmac wrote:
This is normal. I forget the options at this point (dont really use them any more) but there is a default limit as to how long scrubs will run. They remember where they left off and pick up one week later.
Right. I understand all that. I was really hoping more for pointers to documentation that would help me decide wether or not to make any adjustments, and what to adjust _to_. The default, at least for the version of Ontap we're using is described in my original message (as well as, come to think of it, which options are relevant ...).
It also, if I recall correctly will only do so many scrubs at the same time.
I haven't found any documentation to that effect (though, of course it makes sense, and I do expect that's the case). Do you have any you can point me to?
Remember, it is not just scrubbing the aggregate, but looking at each Raid Group for consistency.
Yes, I understand that.
--
Sylvain Robitaille syl@encs.concordia.ca
Systems analyst / AITS Concordia University Faculty of Engineering and Computer Science Montreal, Quebec, Canada
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
On Fri, 31 May 2019, Dalvenjah FoxFire wrote:
So my take has always been to aim for doing a full scrub of all the media in a filer within a month, and have it able to restart from the beginning and repeat the next month, etc.
Thanks. That's at least a data point I can use.