Sorry folks, but I don't have a white paper reference. I got my information from a presentation by our Netapp representative. A big focus of his talk was the Near Store appliances, which have large numbers of large capacity (but less reliable) disks.
Netapp realized that disk failures would happen more often on Near Stores and that reconstruction times on a 250G drive would be very long, making the risk of a double drive failure unacceptable. That is why they recommend double parity raid groups on Near Stores and why they try to speed up reconstruction by using whatever data can be taken from the failed disk.
There is an excellent white paper TR3298 that explains how double parity works.
Mark Simmons mds@gbnet.net writes;
James Brigman wrote:
Steve;
Netapp recently modified their disk reconstruct procedure to copy as much valid data as possible from the failed disk and only reconstruct blocks that cannot be read. Often a disk does not completely fail, so many blocks can be copied from it, which is much faster than reconstructing.
Can you please point us to a whitepaper on this?
[...]
I have to say that, if a disk is known to be failing, I'm not sure I'd want to be trusting the data one would copy from it...
Well, there is the "horizontal" data validation provided by the zone or block checksums to deal with that.
Also, if it's
failing in a way that makes it take a lot of time to serve a block of data, does DOT adjust its strategy accordingly and work out at some point that it should just started recreating the data from the other disks?
That seems like a good question, and I would echo James' request for a whitepaper (or other form of technical detail).
Is the same procedure used when a disk is failed by operator command as well as when ONTAP decides to fail it on its own initiative? The case of failing a disc because its reported error rate is too high for comfort would seem to be one of the most likely scenarios when "many blocks can be copied from it".
Also, if ONTAP is rebooted during the reconstruction, is the "half-failed" status of the disc preserved?
Chris Thompson Email: cet1@cam.ac.uk
Steve Losen scl@virginia.edu phone: 434-924-0640
University of Virginia ITC Unix Support