We've had a 220 for 30 months and a 330 and a 540 for about 18 months. In that time we've never had a raid scrub problem.
On April 25 we got 3 raid scrub error messages. I called netapp (that morning) and the CE said that there was nothing to worry about unless they continue and/or there are hard disk errors associated with the problem. I expected that this was the case, and accepted this response.
Please note that everything seems to be just fine, and I'm probably being a nervous Nelly. The error didn't reoccur, even though we ran scrub twice on the machine during the week.
On Thursday I called back (having had second thoughts, but also having been out of town), and asked for further clarification. In particular:
(1) How can there be a parity inconsistency w/o some sort of disk error? If there was a hard error why is there no log entry for same? If there was no error then how can the parity be wrong? Disk_stat reports that since the filer was booted (2 months ago) there have been 3 recovered errors. However I would not expect a recovered error to produce a parity inconsistency, since the correct data should have been recovered.
(2) How does the system know that the error is in the parity block and not in a data block? The response that we received from netapp support indicated that it may have in fact rewritten a data block, but logged it as a parity block. In either case how can they determine where the error is if a parity inconsistency occurs w/o a corresponding media error?
(3) How can we (as end users) map the stripe number to disk blocks, and then further to the data and/or filesystem info (Inode) blocks.
(4) Short of wacky how can we be really sure of our filesystem/data integrity?
What I got back was (in essence) don't worry, everything is fine and some canned (RAID for dummies) answers about why raid scrubbing is run and a brief synopsis of the error messages. Except for items 3 and 4...
(3) You can't, not really. I believe this and really didn't really expect any other answer, just wishful thinking.
(4) Use wacky and don't forget you have to take the filer out of service and it can take up to 10 hours to check the filesystem [for large values of filesystem]. Or upgrade to 5.3 and it only takes about a hour.
Could someone give me a pointer to a document that answers questions 1 and 2? Or a brief answer to the group? Does anyone else get these messages?
Log file entries: Sun Apr 25 01:00:00 PDT [raid_scrub_admin]: Beginning disk scrubbing... Sun Apr 25 02:39:11 PDT [raid_stripe_owner]: Inconsistent parity on volume vol0, RAID group 0, stripe #606202. Sun Apr 25 02:39:11 PDT [raid_stripe_owner]: Rewriting bad parity block on volume vol0, RAID group 0, stripe #606202. Sun Apr 25 02:39:11 PDT [raid_stripe_owner]: Inconsistent parity on volume vol0, RAID group 0, stripe #606203. Sun Apr 25 02:39:11 PDT [raid_stripe_owner]: Rewriting bad parity block on volume vol0, RAID group 0, stripe #606203. Sun Apr 25 02:39:11 PDT [raid_stripe_owner]: Inconsistent parity on volume vol0, RAID group 0, stripe #606204. Sun Apr 25 02:39:11 PDT [raid_stripe_owner]: Rewriting bad parity block on volume vol0, RAID group 0, stripe #606204. Sun Apr 25 03:48:07 PDT [consumer]: Scrub found 3 parity inconsistencies Sun Apr 25 03:48:07 PDT [consumer]: Scrub found 0 media errors Sun Apr 25 03:48:07 PDT [consumer]: Disk scrubbing finished... ----- Stephen C. Woods; UCLA SEASnet; 2567 Boelter hall; LA CA 90095; (310)-825-8614 Finger for public key scw@cirrus.seas.ucla.edu,Internet mail:scw@SEAS.UCLA.EDU
On Wed, 5 May 1999 scw@seas.ucla.edu wrote:
On April 25 we got 3 raid scrub error messages. I called netapp (that morning) and the CE said that there was nothing to worry about unless they continue and/or there are hard disk errors associated with the problem. I expected that this was the case, and accepted this response.
What I got back was (in essence) don't worry, everything is fine and some canned (RAID for dummies) answers about why raid scrubbing is run and a brief synopsis of the error messages.
My story is different. NAC found some spurious errors like yours in one of my filer's logs. They called me and suggested that they want to run some diags with using DOT 5.2.1D7. I replied that indeed there were some errors, but they went away after I upgraded to 5.2.1P2. We came to conclusion that they should come over and run the diags anyway. They will do this on May 16 if the planets align correctly.
Please let me know if you find out more about this phenomenon. I think it is a bug of some sort in the older version of software rather than a hardware error. Here are the errors from my messages:
messages.2:Sun Apr 11 01:59:01 CDT [raid_stripe_owner]: Inconsistent parity on volume vol0, RAID group 0, stripe #256159. messages.2:Sun Apr 11 01:59:01 CDT [raid_stripe_owner]: Rewriting bad parity block on volume vol0, RAID group 0, stripe #256159. messages.2:Sun Apr 11 01:59:01 CDT [raid_stripe_owner]: Inconsistent parity on volume vol0, RAID group 0, stripe #256160. messages.2:Sun Apr 11 01:59:01 CDT [raid_stripe_owner]: Rewriting bad parity block on volume vol0, RAID group 0, stripe #256160. messages.2:Sun Apr 11 01:59:01 CDT [raid_stripe_owner]: Inconsistent parity on volume vol0, RAID group 0, stripe #256161. messages.2:Sun Apr 11 01:59:01 CDT [raid_stripe_owner]: Rewriting bad parity block on volume vol0, RAID group 0, stripe #256161. messages.2:Sun Apr 11 01:59:01 CDT [raid_stripe_owner]: Inconsistent parity on volume vol0, RAID group 0, stripe #256162. messages.2:Sun Apr 11 01:59:01 CDT [raid_stripe_owner]: Rewriting bad parity block on volume vol0, RAID group 0, stripe #256162. messages.2:Sun Apr 11 01:59:01 CDT [raid_stripe_owner]: Inconsistent parity on volume vol0, RAID group 0, stripe #256163. messages.2:Sun Apr 11 01:59:01 CDT [raid_stripe_owner]: Rewriting bad parity block on volume vol0, RAID group 0, stripe #256163. messages.2:Sun Apr 11 04:59:40 CDT [consumer]: Scrub found 8 parity inconsistencies
Tom