"AT" == Aiello, Tony Tony.Aiello@netapp.com: "GDG" == G D Geen geen@msp.sc.ti.com:
AT> As of the 5.3.2 release we added functionality to reassign blocks as they AT> occur. Prior to that we enabled automatic reassignment features on the disk AT> drives to do this. We found that the disks did not handle the reassignment AT> in all cases we'd like so we took control of that function. That would be
The fact that whatever version this customer was running was not able to handle that common type of disk failure scares the hell out of me for what 5.3.5R2P1 will not be able to handle and how it will bite me when something breaks. Why didn't this make it in to ONTAP before something as useless as the Java GUI (ok, thats a personal opinion; but try and convince me that its worth more than the ability to appropriately handle disk failures). Netapp had years of experience from other vendors to draw from when it was formed. Part of that experience should have been comprehensive knowledge of the kinds of failures people see in the field.
GDG> First and foremost, I take complete responsibility for my filers. I did so GDG> in my message to my management which was forwarded up through VP level.
Of course, and there's always more that can be done at everyone's end. There's a long list of things I'd like to both do and see Netapp do.
Important problems that vendors missed in their own tests make me really worry for that vendor's Q/A and what a product will cost me later when something pathological happens (after all, that's part of the cost of running the box). Nobody gets everything right immediately and that includes Netapp. Nobody's server is perfect for my environment. Everything is a balance of costs and some of those costs are how well it works on its own. What they miss has alot to do with what I consider my mistake to have been. Was my mistake not buying an EMC or a Sun in the first place or was my mistake not noticing the autosupport message (for example)?
GDG> Having said that, I do agree with you. This disk should have failed by the GDG> filer two weeks prior. We at TI are now pushing NetApp to be proactive in
We have been discussing certain things we'd like to see in their product as well (in case any one is curious I can elaborate more here).
GDG> I still prefer to look at the autosupport messages. There is so much that I GDG> glean from these. The information is not just what disks to watch out for
Everything in the autosupport messages can be gleaned from commands you run on the filer and checking the logs. I'd be really interested in hearing what kinds of tools you are working on if you don't mind sharing that info.
GDG> I cannot disclose what fifteen hours of down time cost Texas Instruments, GDG> Inc. as that is proprietary information though we were fortunate that this GDG> was over a holiday weekend.
*shiver*
Rob
----- Original Message ----- From: "Robert L. Millner" rmillner@transmeta.com To: "toasters" toasters@mathworks.com Sent: Wednesday, May 03, 2000 2:21 PM Subject: Re: raid failure
"AT" == Aiello, Tony Tony.Aiello@netapp.com: "GDG" == G D Geen geen@msp.sc.ti.com:
AT> As of the 5.3.2 release we added functionality to reassign blocks as
they
AT> occur. Prior to that we enabled automatic reassignment features on the
disk
AT> drives to do this. We found that the disks did not handle the
reassignment
AT> in all cases we'd like so we took control of that function. That would
be
The fact that whatever version this customer was running was not able to handle that common type of disk failure scares the hell out of me for what 5.3.5R2P1 will not be able to handle and how it will bite me when something breaks. Why didn't this make it in to ONTAP before something as useless as the Java GUI (ok, thats a personal opinion; but try and convince me that its worth more than the ability to appropriately handle disk failures). Netapp had years of experience from other vendors to draw from when it was formed. Part of that experience should have been comprehensive knowledge of the kinds of failures people see in the field.
I think this is a little unfair, unless I read Tony's message wrong. Consider the zillions of hours that have been logged on Netapp filers for years and how many have run into this problem? A fraction of a %? Less? The disks are *supposed* to be able to handle reassignment, and have managed to do so many times for many years. What Netapp has done is decided that even the disks can screw this up sometimes, and they've added another layer of protection by doing it themselves. The chances of it "biting you" are extremely slim... it's not like every disk failure leads to corruption. Block failure reassignment is the common type of failure that Netapp handles fine. *Failure* of the block failure reassignment seems to be a rarer failure that Netapp is now addressing.
I mean, data can still be lost in a double disk failure. Why didn't a second parity drive feature make it into ONTAP before the Java GUI? When not a third parity drive? Why not mirroring? Why not shielding from cosmic rays causing random bit errors?
Important problems that vendors missed in their own tests make me really worry for that vendor's Q/A and what a product will cost me later when something pathological happens (after all, that's part of the cost of running the box). Nobody gets everything right immediately and that includes Netapp. Nobody's server is perfect for my environment. Everything is a balance of costs and some of those costs are how well it works on its own. What they miss has alot to do with what I consider my mistake to have been. Was my mistake not buying an EMC or a Sun in the first place or was my mistake not noticing the autosupport message (for example)?
What makes you think EMC or Sun isn't even more prone to such failures? Or more prone to data corruption? What are their measures availability rates? Netapp is 99.99% including planned downtime.
Bruce