Raid Scrub errors - toasters

5 May 1999


      We've had a 220 for 30 months and a 330 and a 540 for about 18 months. 
In that time we've never had a raid scrub problem.
On April 25 we got 3 raid scrub error messages.  I called netapp (that
morning) and the CE said that there was nothing to worry about unless they
continue and/or there are hard disk errors associated with the problem.   I
expected that this was the case, and accepted this response.
Please note that everything seems to be just fine, and I'm probably
being a nervous Nelly. The error didn't reoccur, even though we ran
scrub twice on the machine during the week.
On Thursday I called back (having had second thoughts, but also having
been out of town), and asked for further clarification.   In particular:
(1) How can there be a parity inconsistency w/o some sort of disk error?
        If there was a hard error why is there no log entry for same?
        If there was no error then how can the parity be wrong?
        Disk_stat reports that since the filer was booted (2 months ago)
        there have been 3 recovered errors. However I would not expect
    a recovered error to produce a parity inconsistency, since the
    correct data should have been recovered.
(2) How does the system know that the error is in the parity block and
    not in a data block?  The response that we received from netapp
        support indicated that it may have in fact rewritten a data block,
    but logged it as a parity block.  In either case how can they
    determine where the error is if a parity inconsistency occurs w/o
    a corresponding media error?
(3) How can we (as end users) map the stripe number to disk blocks, and
       then further to the data and/or filesystem info (Inode) blocks.
(4) Short of wacky how can we be really sure of our filesystem/data integrity?
What I got back was (in essence) don't worry, everything is fine and
some canned (RAID for dummies) answers about why raid scrubbing is run
and a brief synopsis of the error messages. 
Except for items 3 and 4...
(3) You can't, not really. I believe this and really didn't really expect
    any other answer, just wishful thinking.
(4) Use wacky and don't forget you have to take the filer out of service
    and it can take up to 10 hours to check the filesystem [for large values
    of filesystem]. Or upgrade to 5.3 and it only takes about a hour.
Could someone give me a pointer to a document that answers questions 1
and 2?  Or a brief answer to the group?  Does anyone else get these
messages?
Log file entries:
Sun Apr 25 01:00:00 PDT [raid_scrub_admin]: Beginning disk scrubbing...
Sun Apr 25 02:39:11 PDT [raid_stripe_owner]: Inconsistent parity on volume vol0, RAID group 0, stripe #606202.
Sun Apr 25 02:39:11 PDT [raid_stripe_owner]: Rewriting bad parity block on volume vol0, RAID group 0, stripe #606202.
Sun Apr 25 02:39:11 PDT [raid_stripe_owner]: Inconsistent parity on volume vol0, RAID group 0, stripe #606203.
Sun Apr 25 02:39:11 PDT [raid_stripe_owner]: Rewriting bad parity block on volume vol0, RAID group 0, stripe #606203.
Sun Apr 25 02:39:11 PDT [raid_stripe_owner]: Inconsistent parity on volume vol0, RAID group 0, stripe #606204.
Sun Apr 25 02:39:11 PDT [raid_stripe_owner]: Rewriting bad parity block on volume vol0, RAID group 0, stripe #606204.
Sun Apr 25 03:48:07 PDT [consumer]: Scrub found 3 parity inconsistencies
Sun Apr 25 03:48:07 PDT [consumer]: Scrub found 0 media errors
Sun Apr 25 03:48:07 PDT [consumer]: Disk scrubbing finished...
-----
Stephen C. Woods; UCLA SEASnet; 2567 Boelter hall; LA CA 90095; (310)-825-8614
Finger for public key scw@cirrus.seas.ucla.edu,Internet mail:scw@SEAS.UCLA.EDU