Hello,
There's been some questions of late regarding the meaning of soft or recovered errors.
A recovered error is defined as the disk needing to take some action other than the initial access to the media in order to satisfy an i/o request. The methods used may be to retry the request, reseeking to the sector, recreating the data from error correction information or all of the above.
Recovered errors do not indicate pending disk failure. Drive manufacturer's specify a bit error rate of so many recoverable errors per bits of access. Note that bits of access includes sector overhead. A 512 byte sector is bracketed by header and trailer information used to keep track of the location as well as error recovery information.
While recovered errors do not in themselves indicate pending drive failure, an excessive rate of more than 10 a day on a filer under average load would indicate problems in media management.
This condition must be examined more closely to really quantify the error rate. A single sector written with a bit flip will always return a recovered event when read. Naturally. The ECC does not match the stored data so the error correction is invoked to reproduce the proper image. So when examining the recovered error log look at the block address. If each entry references the same block, that is ONE recovered error, regardless of how many entries are made against it.
If each one indicates a different address then the amount of traffic must be considered. Using sysstat to look at the megabyte per second rating of the system. Using that determine the megabyte per hour rating. Multiply by 8 to get bits.
A reasonable number of for the recovered error rate is 10 per 10^12 bits of access. So consider a disk giving an average response of 20 ms per 4K i/o. It's serving 50 transactions per second or .2 MB/sec. In a 24 hour period then it serves about 1.4 * 10^11 bits. Not far from the recovered bit error rate. So a recovered error every day would not even be excessive. Naturally a filer serving more transactions per second would have a higher rate of expected recovered errors.
Realistically, disks don't perform that poorly and may go for long periods of time without reporting a single recovery event.
Our primary motivation in logging these events is more for performance. Since error recovery takes time, a disk having a large number of recovered errors could adversely effect performance. So we log these as a means to aid in diagnosing performance problems.
The short story is: drives are expected to have these recovered errors. Recovered errors don't indicate a pending failure. Removing a disk because of soft errors wastes time and materials. Futher it exposes you to real damage should some other device experience an unrecoverable error.
Tony -------------------------- mailto: taiello@netapp.com Ph:(408) 367-3251 Fax: (408) 367-3151