We have an F330 running 4.3.3D2 where one of the disks is now generating 30 - 40 "READ sector xxxxxxx recovered error" messages per day. At what frequency of such messages should I start worrying about the disk?
Steinar Haug, Nethelp consulting, sthaug@nethelp.no
Sthaug, We have 6 F330 here in Texas Instruments France, since 3 years. The question you are asking is the one that any Netapp sys_admin ask a day or an other. There is generaly no clear reply from NetApp support: I have been replied that up to 40 recoverable error per day is still acceptable. To me this is more than probable that a disk which has 40 recoverable error per day is a disk which will fail soon (within a month). I will never take the risk to let a disk slowly dying, because in the mean time an other can suddenly crash. This will cause the rebuilt of the filesystem on the spare disk, AND ACCELERATE THE DEATH OF THE FIRST DISK. IF THE FIRST DISK DEFENITIVELY DYE DURING THE REBUILT, THEN YOU ARE IN A BIG TROUBLE. THIS HAPPENG IN TIF, WE LOST 50GB OF DATA.
WE CONSIDER THAT 10 UNROCOVERABLE ERRORS PER DAY IS THE LIMIT. MORE THAT 10 I STRONGLY RECOMMEND TO FAIL THE DISK MANUALLY.
On Sat, 16 Jan 1999 sthaug@nethelp.no wrote:
Perhaps I am overestimating the severity of this problem, but I start getting nervous if I see *one* warning of that nature per *week*. That is, if I see a single warning in one of the weekly syslog e-mails, I'll start watching that filer more closely. Netapp has been very good about replacing drives and RAM that ONTAP complains about. I figure I don't use up their resources calling their tech support, so I'll take advantage of their hardware warranty. ;-)
All of our production filers are used very heavily, and I don't want to let a drive degrade to the point where ONTAP reports more than one warning a day. Maybe I'm paranoid about failure, but then that's why I have Netapps in the first place. :)