We do publish a calculation that you can use to estimate the "Mean Time to Data Loss" based on the MTBF for each type of drive. You can get the MTBF number from the individual drive manufacturer's web sites.
http://www.netapp.com/tech_library/3027.html#section3.2
In this reference "Data Loss" is defined as "when the failures of two or more disk drives within the same RAID group overlap" This would be the time that you would be required to restore from tape.
As far as actual field numbers... I really don't know ;(
-----Original Message----- From: Brian Tao [mailto:taob@risc.org] Sent: Sunday, July 30, 2000 6:17 PM To: toasters@mathworks.com Subject: Netapp filesystem failure rate
Does Netapp have any numbers that they could share publically on how often they see filesystem failures in the field (i.e., double disk failure, spare drive bug, etc.) that would require restoring all the data from tape? I did a quick count in my head and figured we have roughly 21 filer-years of operation (1 filer running for 3 years, 2 for 2.5 years, etc.) without a catastrophic failure yet. Is the actual observed number more like 50 filer-years? 100? 200?
As far as actual field numbers... I really don't know ;(
Actual field numbers would no doubt be greater, since you are dealing with a wider variety of drives over history (presumably some less reliable than the current ones) and you also have some rare software / firmware bugs that probably necessitated a restore once or twice.
Bruce
Brian Tao taob@risc.org asks:
Does Netapp have any numbers that they could share publically on
how often they see filesystem failures in the field (i.e., double disk failure, spare drive bug, etc.) that would require restoring all the data from tape?
[...]
Anissa.Mohler@netapp.com (Mohler Anissa) replies: [... theoretical background omitted ... ]
As far as actual field numbers... I really don't know ;(
I don't imagine that Network Appliance will want to talk about the actual frequency of filesystem failures publicly (and this mailing list is public). We all know that there is at least one commercial competitor watching which would have few scruples in spreading the word that "NetApp systems have a MTBFFF of only X years", while saying nothing about their own figures.
Chris Thompson University of Cambridge Computing Service, Email: cet1@ucs.cam.ac.uk New Museums Site, Cambridge CB2 3QG, Phone: +44 1223 334715 United Kingdom.
On Mon, 31 Jul 2000, Chris Thompson wrote:
I don't imagine that Network Appliance will want to talk about the actual frequency of filesystem failures publicly (and this mailing list is public).
No, but I figure I'd ask anyway. ;-) A few folks have replied privately with some of their numbers, which indicate that I'm heading in the right direction, at least. Given a stable of 15 filers and a duty cycle of 5 years, a filesystem failure rate of at most once per 100 filer-years is my comfort level. I suspect that the actual observed failure rate is much lower than that, even if you consider Netapp's optimistic MTTDL calculations at http://www.netapp.com/tech_library/3027.html (12000 years worst case!). The downside of these numbers is that it makes justifying an off-site mirror much more difficult. ;-)