New subject: raid failure

3 May 2000


      Hello,
I don't see the reference to the version of OnTap used but perhaps I can
relate some information.
As of the 5.3.2 release we added functionality to reassign blocks as they
occur. Prior to that we enabled automatic reassignment features on the disk
drives to do this. We found that the disks did not handle the reassignment
in all cases we'd like so we took control of that function. That would be
why you could see multiple reports of bad blocks showing up in subsequent
scrubs. The disk did not do the reassignment and so this bad spot was left
on the media.
As of 5.3.2 messages would appear to the effect of:
Sun Apr 30 04:38:50 MDT [isp2100_main]: Disk 5.14: sector 33601609 will be
reassigned
Reassignment means the device uses a different piece of media to store
information for some block address. Not all errors returned from a disk can
be handled by a block reassignment - really only those that come back as
unrecoverable media errors can you repair by performing a block level
reassignment.
Should the reassigment fail for some reason then the disk is failed as
sector-wise errors can lead to large reliability issues.
Tony 
--------------------------
Tony Aiello, Mgr. Storage Software
mailto:taiello@netapp.com
Ph:(408)822-6515
...
-----Original Message-----
From: Robert L. Millner [mailto:rmillner@transmeta.com]
Sent: Wednesday, May 03, 2000 10:08 AM
To: toasters
Subject: Re: raid failure
Hey,
GDG> autosupport messages too.  As I went back through the autosupport
GDG> logs that are e-mailed to me each week, I found that the problem
GDG> began approximately two weeks earlier.  Every time a the 
disk tried
GDG> to read a particular sector of the disk, an error messages would
GDG> appear the messages log indicating such an event had 
occurred.  Had
GDG> I not been busily working other issues, to the detriment of my 
GDG> filers, I would have failed this disk at lease a week prior.
My immediate question to Netapp in this case would be why was the
periodic disk scrubbing not sufficient to cause the failed 
sectors to be
replaced (this was going on for two weeks)?  Why upon detection of the
block failure (after all, if a log message is generated, then 
the filer
knows it happened) was the data not immediately reconstructed 
elsewhere
and the disk blocks marked as unusable?  A block sized RAID
reconstruction and re-write should be a trivial problem for 
the filer to
solve.  This is the kind of detail I'd expect a storage 
vendor to place
a much higher priority on than having a java GUI.  This is a 
well known
way that disks fail; not some mysterious voodoo issue.  I worry about
what other well known failure modes were left out till a later release
of ONTAP.
GDG> occurred.  Had I not been busily working other issues, to the
GDG> detriment of my filers, I would have failed this disk at lease 
GDG> a week prior.
Had Netapp not been busily working other issues, to the 
detriment of you
and your user's time and data this disk would have failed 
itself or the
filer would have taken some other corrective action on its own.  You
should have your own automated methods for looking for 
problems (like a
script which analyses the logs and reports problems back to 
you).  Don't
be afraid to turn into a nasty bastard in a situation like this.  None
of my users would hesitate for a moment and that may be your last
recourse to making sure that people understand the priority of certain
kinds of issues.
I realize that I am being brutal to Netapp here but that kind 
of failure
would cost us more than twice what we have invested in our 
entire Netapp
infrastructure in time to rebuild the data.  It gives me that cold,
prickly, paranoid feeling about all the data we have on our filers.  I
also realize that there are other potential problems that would have
caused a dual disk failure in one raid group.  This specific problem
should have been dealt with more gracefully by the filer on 
its own.  If
it didn't, then your case alone should have been enough to 
put it on the
'Must Fix This Immediately!' list.
Rob
"You're just the little bundle of negative reinforcement I've
         been looking for."  -Mr. Gone

RE: raid failure