On 05/05/99 13:01:00 you wrote:
On Thursday I called back (having had second thoughts, but also having been out of town), and asked for further clarification. In particular:
(1) How can there be a parity inconsistency w/o some sort of disk error?
Short of a software error in calculating parity or a similar write problem, there probably isn't a way.
If there was a hard error why is there no log entry for same? If there was no error then how can the parity be wrong?
A variety of ways. Two immediately spring to mind:
1. The block with the error did not go bad until after data had been written to it, but before data was read from it. 2. The drive thought the data was written correctly but was incorrect. 3. The drive thought the data was read correctly but was incorrect.
Disk_stat reports that since the filer was booted (2 months ago) there have been 3 recovered errors. However I would not expect
a recovered error to produce a parity inconsistency, since the correct data should have been recovered.
I've usually seen parity inconsistencies as a result of an improper shutdown. However these are repaired immediately upon reboot.
(2) How does the system know that the error is in the parity block and not in a data block? The response that we received from netapp support indicated that it may have in fact rewritten a data block, but logged it as a parity block.
I don't think it may have in fact rewritten a data block (I could be wrong). Rather, they read all the data, and the parity block was inconsistent with the data, so they rewrote the parity information. I'm not sure if there is any way to know for sure that the parity is actually correct and it is the data block that is wrong.
In either case how can they
determine where the error is if a parity inconsistency occurs w/o a corresponding media error?
I think you're basically saying the same thing here; the problem could be in the data block, not the parity block. In any case, assuming it is parity then they can determine which block goes on that stripe.
(3) How can we (as end users) map the stripe number to disk blocks, and then further to the data and/or filesystem info (Inode) blocks.
I don't think there's any way without an expensive scan of various metadata files. However, it would be nice if every such disk message contained such information.
(4) Short of wacky how can we be really sure of our filesystem/data integrity?
No way, but if the filesystem is fine then wacky won't tell you if the data is corrupt either. If the block of zeroes got all flipped to ones, and the parity was rewritten to accomodate that, there's no way to know that it happened unless you happen to know the block should be otherwise in a given file.
What I got back was (in essence) don't worry, everything is fine and some canned (RAID for dummies) answers about why raid scrubbing is run and a brief synopsis of the error messages. Except for items 3 and 4...
(3) You can't, not really. I believe this and really didn't really expect any other answer, just wishful thinking.
(4) Use wacky and don't forget you have to take the filer out of service and it can take up to 10 hours to check the filesystem [for large values of filesystem]. Or upgrade to 5.3 and it only takes about a hour.
Could someone give me a pointer to a document that answers questions 1 and 2? Or a brief answer to the group? Does anyone else get these messages?
Yep. The bottom line is - don't worry about it, as they are probably of no consequence. If you continue to get inconsistent parity, it could be a pointer to some other problem.
Bruce