Hello NetApp Gurus. I'm hoping mostly to get some guidance on RAID group scrubbing options and with some luck, perhaps pointers to documentation that would help me determine appropriate values in our environment.
We've been seeing what feels like a large number of scrubbing "timeouts" on our filers. Log entries similar to this one:
May 28 05:00:02 fc1-ev-n1.console.private [kern.notice] [fc1-ev-n1:raid.scrub.suspended.timer:notice]: Disk scrub suspended because the scrub time limit 240 was exceeded. It will resume at the next weekly/scheduled scrub.
What we'd like to know is how concerned we should be about this. Ideally we'd be seeing "scrubs" complete reasonably frequently, but I'm honestly not sure how I could determine how frequently that happens, or how frequently it *should* happen for that matter.
The scrub options on the filer have not been changed from the Ontap defaults (NetApp Release 9.5P2, but we've been seeing this with earlier versions as well):
Node Option Value Constraint ------- ----------------------- ------ ---------- fc1-n1 raid.media_scrub.rate 600 only_one fc1-n1 raid.scrub.perf_impact low only_one fc1-n1 raid.scrub.schedule none
(and the same for the partner node, of course)
The "storage raid-options" manual page indicates that the default schedule of daily at 1am for 4 hours, except Sundays when it runs for 12 hours, applies if no explicit schedule is defined.
If I examine the scrub status of our aggregates:
fc1-ev::> storage aggregate scrub -aggregate * -action status
Raid Group:/e1n2_tssd/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 02:39:10 2019 Raid Group:/e1n2_t01/plex0/rg0, Is Suspended:true, Last Scrub:Sun May 26 02:55:15 2019, Percentage Completed:65% Raid Group:/e1n2_t02/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 02:28:37 2019 Raid Group:/e1n2_root/plex0/rg0, Is Suspended:false, Last Scrub:Thu May 30 03:39:03 2019 Raid Group:/e1n1_root/plex0/rg0, Is Suspended:false, Last Scrub:Wed May 29 02:03:43 2019 Raid Group:/e1n1_d01/plex0/rg0, Is Suspended:true, Last Scrub:Tue May 28 03:22:52 2019, Percentage Completed:7% Raid Group:/e1n1_d01/plex0/rg1, Is Suspended:true, Last Scrub:Wed May 29 02:00:56 2019, Percentage Completed:4% Raid Group:/e1n1_d01/plex0/rg2, Is Suspended:false, Last Scrub:Wed May 29 04:02:50 2019 Raid Group:/e1n1_d01/plex0/rg3, Is Suspended:false, Last Scrub:Wed May 29 04:08:45 2019 Raid Group:/e1n1_d00/plex0/rg0, Is Suspended:true, Last Scrub:Sun Apr 28 06:00:40 2019, Percentage Completed:81% Raid Group:/e1n1_d00/plex0/rg1, Is Suspended:true, Last Scrub:Sun Apr 28 07:38:30 2019, Percentage Completed:80%
The truth is I'm not sure how to interpret this output:
- Is it the case that each RAID group where "Is Suspended:false" *completed* its scrub at the "Last Scrub" time, while those that are suspended are those for which we're seeing log entries?
- Given the default schedule that has the scrub run for 12 hours on Sunday mornings, does it seem odd that /e1n2_t01/plex0/rg0 was suspended last Sunday at 02:55:15, prior to completion? In fact, all those interrupted on a Sunday were interrupted well before 12 hours. Might there be other reasons for suspending scrub operations? The load on this filer is not excessive in any way: CPU utilization is typically comfortably below 50%
- How do I determine why the two RAID groups containing e1n1_d00 haven't run scrubbing in over a month? Is there something I should do about that?
I've found documentation that explains the options and how to change them, but none that explains how to decide whether I *should* change them, or how to determine what to change them to. I'm interpretting that raid.media_scrub.rate and raid.scrub.schedule could be used together to tune the scrubbing, but am quite unsure how to determine what the best values would be for our filers. Any pointers to documentation that would help here would be hugely appreciated.
Thanks in advance ...