How does the Netapp know if there is bad data on reads then? Does
it rely on the drive to signal bit errors?
We rely on the disk's build-in data checking to tell us when something went wrong.
This is why you can get away with a simple parity approach.
Even we if we did do XOR calculations for every read, that wouldn't give us enough information to fix the problem. All we would know is that a given stripe was inconsistent. The parity information is only sufficient to fix the data if you also know which block to fix. (You could design RAID around a full ECC code, but that would create quite a bit more overhead.)
The performance hit of doing RAID parity on every read would be astoundingly dismal -- much worse even than for writing. To write a single block in a stripe, you read that block and the parity block, do some math, and then write both blocks -- a total of 4 I/Os for the write. And that's true even if you've got 20 disks in your array. By contrast, to do checking on a READ with a 20 disk stripe, you would have to read a block from all 20 disks, for a total of 20 I/Os for the read. YOW!
And that doesn't even take into account the fact that for writes, we can do WAFL's write anywhere cleverness to avoid seeks, and write multiple blocks in a stripe to reduce that 4-to-1 penalty. Reads tend to come in randomly, so the pain is harder to reduce.
Dave