Re: raid failure

3 May 2000


      "Robert L. Millner" wrote:
...
Hey,
GDG> autosupport messages too.  As I went back through the autosupport
GDG> logs that are e-mailed to me each week, I found that the problem
GDG> began approximately two weeks earlier.  Every time a the disk tried
GDG> to read a particular sector of the disk, an error messages would
GDG> appear the messages log indicating such an event had occurred.  Had
GDG> I not been busily working other issues, to the detriment of my
GDG> filers, I would have failed this disk at lease a week prior.
My immediate question to Netapp in this case would be why was the
periodic disk scrubbing not sufficient to cause the failed sectors to be
replaced (this was going on for two weeks)?  Why upon detection of the
block failure (after all, if a log message is generated, then the filer
knows it happened) was the data not immediately reconstructed elsewhere
and the disk blocks marked as unusable?  A block sized RAID
reconstruction and re-write should be a trivial problem for the filer to
solve.  This is the kind of detail I'd expect a storage vendor to place
a much higher priority on than having a java GUI.  This is a well known
way that disks fail; not some mysterious voodoo issue.  I worry about
what other well known failure modes were left out till a later release
of ONTAP.
First and foremost, I take complete responsibility for my filers.  I did so
in my message to my management which was forwarded up through VP level.
Having said that, I do agree with you.  This disk should have failed by the
filer two weeks prior.  We at TI are now pushing NetApp to be proactive in
producing a stable product.  I have administrated NAFS 1300/1400,  F300s,
F500s, and F700s.  With each new rendition of the hardware, I have found the
reliability diminished.  At the Customer Advisory Council I advised them to
make a more stable system.  Recently representatives from Texas Instruments,
Inc. again told NetApp to provide a more stable system.  I, personally, am
finding harder to defend NetApp.  Over-all they have good service after the
sale -- if you live in North America.
I still prefer to look at the autosupport messages.  There is so much that I
glean from these.  The information is not just what disks to watch out for
but are the network interfaces being over-run.  I have begun writing a
rather large complement of tools in Perl to make the job easier as the
number of filers continues to grow.  We are also looking at what other
storage vendors can do for us.  This is not just an issues with failures,
the NetApp filers are no longer able to supply the power needed at peak
times in our process.
I cannot disclose what fifteen hours of down time cost Texas Instruments,
Inc. as that is proprietary information though we were fortunate that this
was over a holiday weekend.
-gdg
--
---------------------------------------------------------------
G D Geen                        mailto:geen@ti.com
Texas Instruments               Phone : (214)480.7896
System Administrator            FAX   : (214)480.7676
---------------------------------------------------------------
Life is what happens while you're busy making other plans.
                                              -J. Lennon

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: raid failure