Hello,
There's been some questions of late regarding the meaning of soft or
recovered errors.
A recovered error is defined as the disk needing to take some action
other than the initial access to the media in order to satisfy an i/o
request. The methods used may be to retry the request, reseeking to the
sector, recreating the data from error correction information or all of
the above.
Recovered errors do not indicate pending disk failure. Drive
manufacturer's specify a bit error rate of so many recoverable errors
per bits of access. Note that bits of access includes sector overhead. A
512 byte sector is bracketed by header and trailer information used to
keep track of the location as well as error recovery information.
While recovered errors do not in themselves indicate pending drive
failure, an excessive rate of more than 10 a day on a filer under
average load would indicate problems in media management.
This condition must be examined more closely to really quantify the
error rate. A single sector written with a bit flip will always return a
recovered event when read. Naturally. The ECC does not match the stored
data so the error correction is invoked to reproduce the proper image.
So when examining the recovered error log look at the block address. If
each entry references the same block, that is ONE recovered error,
regardless of how many entries are made against it.
If each one indicates a different address then the amount of traffic
must be considered. Using sysstat to look at the megabyte per second
rating of the system. Using that determine the megabyte per hour rating.
Multiply by 8 to get bits.
A reasonable number of for the recovered error rate is 10 per 10^12 bits
of access. So consider a disk giving an average response of 20 ms per 4K
i/o. It's serving 50 transactions per second or .2 MB/sec. In a 24 hour
period then it serves about 1.4 * 10^11 bits. Not far from the recovered
bit error rate. So a recovered error every day would not even be
excessive. Naturally a filer serving more transactions per second would
have a higher rate of expected recovered errors.
Realistically, disks don't perform that poorly and may go for long
periods of time without reporting a single recovery event.
Our primary motivation in logging these events is more for performance.
Since error recovery takes time, a disk having a large number of
recovered errors could adversely effect performance. So we log these as
a means to aid in diagnosing performance problems.
The short story is: drives are expected to have these recovered errors.
Recovered errors don't indicate a pending failure. Removing a disk
because of soft errors wastes time and materials. Futher it exposes you
to real damage should some other device experience an unrecoverable
error.
Tony
--------------------------
mailto: taiello(a)netapp.com
Ph:(408) 367-3251
Fax: (408) 367-3151