From the looks of your syslog messages, it appears
that you are running 6.x. The good news here is that when the time expires under 6.x, the disk scrub picks up at that point on the next disk scrub. So you may not be scrubbing every sector every week, but you will be scrubbing every sector every 2 or 3 weeks.
Just some info to add to your options.
Hope this helps.
-- Adam Fox NetApp Professional Services, NC adamfox@netapp.com
-----Original Message----- From: Hannes Herret [mailto:hh@bacher.at] Sent: Tuesday, July 17, 2001 4:48 AM To: toasters@mathworks.com Cc: its-storage@bacher.at Subject: disk scrub time
hi all,
normally the disk scrub starts on sunday 1 a.m. cause of the running ndmp-backup, the filer counldn't finish the scrub within the default deadline time of 6 h.
there are 4 volumes on the filer and just 1 finished the scrub. the other get stopped like following: Sun Jul 8 01:00:00 CES [fsup02: raid_scrub_admin:notice]: Beginning disk scrubbing... Sun Jul 8 07:00:01 CES [fsup02: consumer:notice]: Disk scrub stopped because the scrub time limit was exceeded. Sun Jul 8 07:00:01 CES [fsup02: consumer:notice]: Scrub for volume vol0, RAID group 0 stopped at stripe 13435506. Sun Jul 8 07:00:01 CES [fsup02: consumer:notice]: 0 new parity inconsistencies found, 0 to date. Sun Jul 8 07:00:01 CES [fsup02: consumer:notice]: 0 new checksum errors found, 0 to date.
how to resolve this?
a. disable automatic disk scrub and start it with cron via rsh i don' like it, cause of the administration b. set the day of the disk scrub to e.g. saturday how to set on the filer? c. lenghten the scrub time limit with raid.scrub.duration i think the default is 0 and the setting works with minutes 0 means unlimited time? 480 means 8 hours? d. optimize scrubbing with raid.scrubbers default is 2 is that the number of processes ontap should use for scrubbing?
a or b or c or d, or all of them?
any other suggestions or comments are welcome !!
cu hannes
--
Hannes Herret IT-Service / Storage phone : +43 (1) 60 126-34 Bacher Systems EDV GmbH fax : +43 (1) 60 126-555 Wienerbergstr. 11B mailto:hh@bacher.at A-1101 Wien, Austria www : http://www.bacher.at/ Europe
Adam.Fox@netapp.com writes:
From the looks of your syslog messages, it appears that you are running 6.x. The good news here is that when the time expires under 6.x, the disk scrub picks up at that point on the next disk scrub. So you may not be scrubbing every sector every week, but you will be scrubbing every sector every 2 or 3 weeks.
Does this sort of restarting from a checkpoint (presumably the information in the state.raid.scrub.* entries in /etc/registry) apply if the scrub was interrupted by a "raid scrub stop" command?
Hannes Herret hh@bacher.at originally wrote
normally the disk scrub starts on sunday 1 a.m. cause of the running ndmp-backup, the filer couldn't finish the scrub within the default deadline time of 6 h.
and it seems to me that a sensible thing to do would be to interrupt the scrub if it hasn't finished by the time the backups start, rather than let them madly compete with each other for disk bandwidth.
Chris Thompson University of Cambridge Computing Service, Email: cet1@ucs.cam.ac.uk New Museums Site, Cambridge CB2 3QG, Phone: +44 1223 334715 United Kingdom.
Chris Thompson wrote:
Adam.Fox@netapp.com writes:
From the looks of your syslog messages, it appears that you are running 6.x. The good news here is that when the time expires under 6.x, the disk scrub picks up at that point on the next disk scrub. So you may not be scrubbing every sector every week, but you will be scrubbing every sector every 2 or 3 weeks.
Does this sort of restarting from a checkpoint (presumably the information in the state.raid.scrub.* entries in /etc/registry) apply if the scrub was interrupted by a "raid scrub stop" command?
hold on chris, i will verify it.
Hannes Herret hh@bacher.at originally wrote
normally the disk scrub starts on sunday 1 a.m. cause of the running ndmp-backup, the filer couldn't finish the scrub within the default deadline time of 6 h.
and it seems to me that a sensible thing to do would be to interrupt the scrub if it hasn't finished by the time the backups start, rather than let them madly compete with each other for disk bandwidth.
it was a misinformation, there is no ndmp-backup running on sunday!! it seems to be normal, to take 10 hours to scrub 2x 19 x 36Gb-Disks.
but another question comes up: this conifg is an f820-cluster. how the clusternodes handles the scrubing? each node scrubs both loops? maybe this is the reason for the long scrub time.
Chris Thompson University of Cambridge Computing Service, Email: cet1@ucs.cam.ac.uk New Museums Site, Cambridge CB2 3QG, Phone: +44 1223 334715 United Kingdom.
hh@bacher.at (Hannes Herret) writes [...]
it was a misinformation, there is no ndmp-backup running on sunday!! it seems to be normal, to take 10 hours to scrub 2x 19 x 36Gb-Disks.
That does seem about right, for 38 x 36GB discs on a single FCAL loop. How are they divided up into volumes and RAID groups? The syslog'd messages report on each of these as they complete, and the timings may be suggestive. You can also use "sysconfig -r" while the scrub is running, and it will tell you which RAID groups are being done and what the progress on each is. It's possible for the 2 parallel scrub processes (by default, "options raid.scrubbers" can alter it) to choose to do things in a non-optimal order, so that a single large RAID group gets scrubbed by itself at the end.
but another question comes up: this conifg is an f820-cluster. how the clusternodes handles the scrubing? each node scrubs both loops? maybe this is the reason for the long scrub time.
Surely not!? Only online volumes are scrubbed, and only by the node currently controlling them.
When you say "2 x 19 x 36GB disks" is that meant to imply 19 x 36GB discs on each of the two F820s? In that case, 10 hours for a scrub sounds way too high, unless there has been a failover and one of them is struggling to scrub both sets of discs.
Chris Thompson University of Cambridge Computing Service, Email: cet1@ucs.cam.ac.uk New Museums Site, Cambridge CB2 3QG, Phone: +44 1223 334715 United Kingdom.
Chris Thompson wrote:
hh@bacher.at (Hannes Herret) writes [...]
it was a misinformation, there is no ndmp-backup running on sunday!! it seems to be normal, to take 10 hours to scrub 2x 19 x 36Gb-Disks.
That does seem about right, for 38 x 36GB discs on a single FCAL loop. How are they divided up into volumes and RAID groups? The syslog'd messages report on each of these as they complete, and the timings may be suggestive. You can also use "sysconfig -r" while the scrub is running, and it will tell you which RAID groups are being done and what the progress on each is. It's possible for the 2 parallel scrub processes (by default, "options raid.scrubbers" can alter it) to choose to do things in a non-optimal order, so that a single large RAID group gets scrubbed by itself at the end.
that means 1 scrubber per volume ? 2 volumes are scrubbed at the same time, or 1 volume is scrubbed by 2 scrubbers - e.g. 1 scrubber beginning with the first block, the other in the mid of the disk?
but another question comes up: this conifg is an f820-cluster. how the clusternodes handles the scrubing? each node scrubs both loops? maybe this is the reason for the long scrub time.
Surely not!? Only online volumes are scrubbed, and only by the node currently controlling them.
When you say "2 x 19 x 36GB disks" is that meant to imply 19 x 36GB discs on each of the two F820s? In that case, 10 hours for a scrub sounds way too high, unless there has been a failover and one of them is struggling to scrub both sets of discs.
yes, i mean a cluster-configuration with 19 disks on each side. i agree - i also think 10 hours are too high for 19 disks.
Chris Thompson University of Cambridge Computing Service, Email: cet1@ucs.cam.ac.uk New Museums Site, Cambridge CB2 3QG, Phone: +44 1223 334715 United Kingdom.
hh@bacher.at (Hannes Herret) writes
Chris Thompson wrote:
hh@bacher.at (Hannes Herret) writes [...]
it was a misinformation, there is no ndmp-backup running on sunday!! it seems to be normal, to take 10 hours to scrub 2x 19 x 36Gb-Disks.
That does seem about right, for 38 x 36GB discs on a single FCAL loop. How are they divided up into volumes and RAID groups? The syslog'd messages report on each of these as they complete, and the timings may be suggestive. You can also use "sysconfig -r" while the scrub is running, and it will tell you which RAID groups are being done and what the progress on each is. It's possible for the 2 parallel scrub processes (by default, "options raid.scrubbers" can alter it) to choose to do things in a non-optimal order, so that a single large RAID group gets scrubbed by itself at the end.
[& later]
When you say "2 x 19 x 36GB disks" is that meant to imply 19 x 36GB discs on each of the two F820s? In that case, 10 hours for a scrub sounds way too high, unless there has been a failover and one of them is struggling to scrub both sets of discs.
that means 1 scrubber per volume ? 2 volumes are scrubbed at the same time, or 1 volume is scrubbed by 2 scrubbers - e.g. 1 scrubber beginning with the first block, the other in the mid of the disk?
[& later]
yes, i mean a cluster-configuration with 19 disks on each side. i agree - i also think 10 hours are too high for 19 disks.
Scrubbing is done on a per RAID group basis. My understanding is that each scrubbing process does one RAID group at a time: with the default value of "options raid.scrubbers 2" this means that two RAID groups (which might or might not be part of the same volume) can be processed in parallel.
You still haven't told us how your discs are allocated to volumes and RAID groups. If all 19 discs are a single RAID group in a single volume (that would be larger than the default raidsize) then only one scrubbing process would be active.
Chris Thompson University of Cambridge Computing Service, Email: cet1@ucs.cam.ac.uk New Museums Site, Cambridge CB2 3QG, Phone: +44 1223 334715 United Kingdom.
Adam.Fox@netapp.com wrote:
From the looks of your syslog messages, it appears that you are running 6.x. The good news here is that when the time expires under 6.x, the disk scrub picks up at that point on the next disk scrub. So you may not be scrubbing every sector every week, but you will be scrubbing every sector every 2 or 3 weeks.
and I asked: < < Does this sort of restarting from a checkpoint (presumably the < information in the state.raid.scrub.* entries in /etc/registry) < apply if the scrub was interrupted by a "disk scrub stop" command?
and Hannes Herret hh@bacher.at replied (off-list): | | yes. if you stop a manually initiated disk scrub, the last stripe is | also logged to the registry.
This is certainly true, and the automatic scrub at 01:00 on Sunday will restart from a checkpoint created by "disk scrub stop" - I have tried it.
But there's an infelicity (bug?) in that "disk scrub start" always starts a full scrub starting at the beginning, and ignores any checkpoint left by "disk scrub stop" (or, presumably, by a scrub timeout, although I haven't tested that case). This means that if one wants to run a regular scrub at some other time than the default, one cannot get any advantage from the checkpointing.
This cannot be right, surely? At the very least, there ought to be "disk scrub continue" option that does respect the checkpoint info.
All tests were done with DOT 6.1R1.
Chris Thompson University of Cambridge Computing Service, Email: cet1@ucs.cam.ac.uk New Museums Site, Cambridge CB2 3QG, Phone: +44 1223 334715 United Kingdom.