New subject: raid failure

3 May 2000


      I have lived through this harrowing experience.  Yes there were warning signs but
as I was busy performing other tasks I did not see them.  The moral of my story is
to read your weekly message logs and have good back ups.  I also file every weekly
autosupport message. I have message logs from everyone of my filers, we currently
have twenty-one, dating back to 1996 when I started this position.
It began with a phone call in the middle of the night.  On the other end was a
co-worker confused as to why the filer was no longer serving data.  She had made
the determination that there was a failed disk, but little did she know what the
next fifteen hours would bring.  The reported disk failure was being reconstructed
but would never complete.  That is when I got the phone call.  By this time it was
just after 10:00 PM and I rushed into work.  I immediately got the run down on the
problem she was seeing and what actions were taken.  From that point I made a few
stabs at getting the RAID to reconstruct.  My attempts also failed, now we had to
call NetApp support.  I made the priority one call and was soon contacted by a
NetApp engineer.  After a couple of hours poking around on the filer, we were able
to determine the sequence of events leading up to the nights exercises in
futility.  Approximately two weeks prior, the filer had reported in the messages
logs that a disk in RAID group zero on volume zero, the root volume, had an
unrecoverable read error but did not fail the disk.  This night, now closer to
morning, the system failed a different disk in RAID group zero on volume zero.  See
the connection?  The net result was the filer could not reconstruct the data
because of the unrecoverable read error on the earlier disk.  The filer could not
make any sense of the data coming off the corrupted sector of the disk.  By this
time our local support people were awaken by the central support team and had
joined us.  We were starting to draw a crowd.  It was closing in on 2:00 AM when we
tried to reconstruct the disk with the bad read sector manually but after a couple
of failed attempts even the NetApp engineer was beginning to sound depressed.  Then
came the question.  The one that all data managers hate to hear.  How good are your
backups?  To this question I replied, "we are about to find out."
Since our root volume failed, the first step was to make a new root volume.  I made
/vol/vol1 our new root volume.  Next we destroyed /vol/vol0 and removed all the
troubled disk drives.  Since all of these disks are now spares, we may create a new
RAID group and a new /vol/vol0.  Here it is important to note that there must be a
/etc on the volume *before* Veritas can begin its data restore.  Once the volume
was recreated, we began our restore of approximately 150 Gigabytes of data which
took us well into the morning.
The NetApp engineer in California and I had a long conversation over the phone
while the data restores were beginning.  He was certain that there should have been
some sort of sign that this was about to happen.  I assured him that I would look
and would forward my results to our local support people, who are copied on the
autosupport messages too.  As I went back through the autosupport logs that are
e-mailed to me each week, I found that the problem began approximately two weeks
earlier.  Every time a the disk tried to read a particular sector of the disk, an
error messages would appear the messages log indicating such an event had
occurred.  Had I not been busily working other issues, to the detriment of my
filers, I would have failed this disk at lease a week prior.
-gdg
Mark D Fowle wrote:
...
I have heard a few horror stories lately about netapps and multi-disk raid
 failures.  Has anyone out there experienced this
and what did you do for recovery ?  Where there any warnings?  I have not had
 this happen and would like to do as much
as possible to prevent it.
Thanks,
=======
Mark Fowle
Caterpillar/BCP
Cary North Carolina
===============================================================================
=======
--
---------------------------------------------------------------
G D Geen                        mailto:geen@ti.com
Texas Instruments               Phone : (214)480.7896
System Administrator            FAX   : (214)480.7676
---------------------------------------------------------------
Life is what happens while you're busy making other plans.
                                              -J. Lennon

Re: raid failure

Thanks,