"AT" == Aiello, Tony Tony.Aiello@netapp.com: "GDG" == G D Geen geen@msp.sc.ti.com:
AT> As of the 5.3.2 release we added functionality to reassign blocks as they AT> occur. Prior to that we enabled automatic reassignment features on the disk AT> drives to do this. We found that the disks did not handle the reassignment AT> in all cases we'd like so we took control of that function. That would be
The fact that whatever version this customer was running was not able to handle that common type of disk failure scares the hell out of me for what 5.3.5R2P1 will not be able to handle and how it will bite me when something breaks. Why didn't this make it in to ONTAP before something as useless as the Java GUI (ok, thats a personal opinion; but try and convince me that its worth more than the ability to appropriately handle disk failures). Netapp had years of experience from other vendors to draw from when it was formed. Part of that experience should have been comprehensive knowledge of the kinds of failures people see in the field.
GDG> First and foremost, I take complete responsibility for my filers. I did so GDG> in my message to my management which was forwarded up through VP level.
Of course, and there's always more that can be done at everyone's end. There's a long list of things I'd like to both do and see Netapp do.
Important problems that vendors missed in their own tests make me really worry for that vendor's Q/A and what a product will cost me later when something pathological happens (after all, that's part of the cost of running the box). Nobody gets everything right immediately and that includes Netapp. Nobody's server is perfect for my environment. Everything is a balance of costs and some of those costs are how well it works on its own. What they miss has alot to do with what I consider my mistake to have been. Was my mistake not buying an EMC or a Sun in the first place or was my mistake not noticing the autosupport message (for example)?
GDG> Having said that, I do agree with you. This disk should have failed by the GDG> filer two weeks prior. We at TI are now pushing NetApp to be proactive in
We have been discussing certain things we'd like to see in their product as well (in case any one is curious I can elaborate more here).
GDG> I still prefer to look at the autosupport messages. There is so much that I GDG> glean from these. The information is not just what disks to watch out for
Everything in the autosupport messages can be gleaned from commands you run on the filer and checking the logs. I'd be really interested in hearing what kinds of tools you are working on if you don't mind sharing that info.
GDG> I cannot disclose what fifteen hours of down time cost Texas Instruments, GDG> Inc. as that is proprietary information though we were fortunate that this GDG> was over a holiday weekend.
*shiver*
Rob