New subject: raid failure

3 May 2000


      "AT" == Aiello, Tony Tony.Aiello@netapp.com:
"GDG" == G D Geen geen@msp.sc.ti.com:
AT> As of the 5.3.2 release we added functionality to reassign blocks as they
AT> occur. Prior to that we enabled automatic reassignment features on the disk
AT> drives to do this. We found that the disks did not handle the reassignment
AT> in all cases we'd like so we took control of that function. That would be
The fact that whatever version this customer was running was not able to
handle that common type of disk failure scares the hell out of me for
what 5.3.5R2P1 will not be able to handle and how it will bite me when
something breaks.  Why didn't this make it in to ONTAP before something
as useless as the Java GUI (ok, thats a personal opinion; but try and
convince me that its worth more than the ability to appropriately handle
disk failures).  Netapp had years of experience from other vendors to
draw from when it was formed.  Part of that experience should have been
comprehensive knowledge of the kinds of failures people see in the
field.
GDG> First and foremost, I take complete responsibility for my filers.  I did so
GDG> in my message to my management which was forwarded up through VP level.
Of course, and there's always more that can be done at everyone's end.
There's a long list of things I'd like to both do and see Netapp do.
Important problems that vendors missed in their own tests make me really
worry for that vendor's Q/A and what a product will cost me later when
something pathological happens (after all, that's part of the cost of
running the box).  Nobody gets everything right immediately and that
includes Netapp.  Nobody's server is perfect for my environment.
Everything is a balance of costs and some of those costs are how well it
works on its own.  What they miss has alot to do with what I consider my
mistake to have been.  Was my mistake not buying an EMC or a Sun in the
first place or was my mistake not noticing the autosupport message (for
example)?
GDG> Having said that, I do agree with you.  This disk should have failed by the
GDG> filer two weeks prior.  We at TI are now pushing NetApp to be proactive in
We have been discussing certain things we'd like to see in their product
as well (in case any one is curious I can elaborate more here).
GDG> I still prefer to look at the autosupport messages.  There is so much that I
GDG> glean from these.  The information is not just what disks to watch out for
Everything in the autosupport messages can be gleaned from commands you
run on the filer and checking the logs.  I'd be really interested in
hearing what kinds of tools you are working on if you don't mind sharing
that info.
GDG> I cannot disclose what fifteen hours of down time cost Texas Instruments,
GDG> Inc. as that is proprietary information though we were fortunate that this
GDG> was over a holiday weekend.
*shiver*
Rob

Re: raid failure