We've had a 220 for 30 months and a 330 and a 540 for about 18 months. In that time we've never had a raid scrub problem.
On April 25 we got 3 raid scrub error messages. I called netapp (that morning) and the CE said that there was nothing to worry about unless they continue and/or there are hard disk errors associated with the problem. I expected that this was the case, and accepted this response.
Please note that everything seems to be just fine, and I'm probably being a nervous Nelly. The error didn't reoccur, even though we ran scrub twice on the machine during the week.
On Thursday I called back (having had second thoughts, but also having been out of town), and asked for further clarification. In particular:
(1) How can there be a parity inconsistency w/o some sort of disk error? If there was a hard error why is there no log entry for same? If there was no error then how can the parity be wrong? Disk_stat reports that since the filer was booted (2 months ago) there have been 3 recovered errors. However I would not expect a recovered error to produce a parity inconsistency, since the correct data should have been recovered.
(2) How does the system know that the error is in the parity block and not in a data block? The response that we received from netapp support indicated that it may have in fact rewritten a data block, but logged it as a parity block. In either case how can they determine where the error is if a parity inconsistency occurs w/o a corresponding media error?
(3) How can we (as end users) map the stripe number to disk blocks, and then further to the data and/or filesystem info (Inode) blocks.
(4) Short of wacky how can we be really sure of our filesystem/data integrity?
What I got back was (in essence) don't worry, everything is fine and some canned (RAID for dummies) answers about why raid scrubbing is run and a brief synopsis of the error messages. Except for items 3 and 4...
(3) You can't, not really. I believe this and really didn't really expect any other answer, just wishful thinking.
(4) Use wacky and don't forget you have to take the filer out of service and it can take up to 10 hours to check the filesystem [for large values of filesystem]. Or upgrade to 5.3 and it only takes about a hour.
Could someone give me a pointer to a document that answers questions 1 and 2? Or a brief answer to the group? Does anyone else get these messages?
Log file entries: Sun Apr 25 01:00:00 PDT [raid_scrub_admin]: Beginning disk scrubbing... Sun Apr 25 02:39:11 PDT [raid_stripe_owner]: Inconsistent parity on volume vol0, RAID group 0, stripe #606202. Sun Apr 25 02:39:11 PDT [raid_stripe_owner]: Rewriting bad parity block on volume vol0, RAID group 0, stripe #606202. Sun Apr 25 02:39:11 PDT [raid_stripe_owner]: Inconsistent parity on volume vol0, RAID group 0, stripe #606203. Sun Apr 25 02:39:11 PDT [raid_stripe_owner]: Rewriting bad parity block on volume vol0, RAID group 0, stripe #606203. Sun Apr 25 02:39:11 PDT [raid_stripe_owner]: Inconsistent parity on volume vol0, RAID group 0, stripe #606204. Sun Apr 25 02:39:11 PDT [raid_stripe_owner]: Rewriting bad parity block on volume vol0, RAID group 0, stripe #606204. Sun Apr 25 03:48:07 PDT [consumer]: Scrub found 3 parity inconsistencies Sun Apr 25 03:48:07 PDT [consumer]: Scrub found 0 media errors Sun Apr 25 03:48:07 PDT [consumer]: Disk scrubbing finished... ----- Stephen C. Woods; UCLA SEASnet; 2567 Boelter hall; LA CA 90095; (310)-825-8614 Finger for public key scw@cirrus.seas.ucla.edu,Internet mail:scw@SEAS.UCLA.EDU