Hello,
I don't see the reference to the version of OnTap used but perhaps I can
relate some information.
As of the 5.3.2 release we added functionality to reassign blocks as they
occur. Prior to that we enabled automatic reassignment features on the disk
drives to do this. We found that the disks did not handle the reassignment
in all cases we'd like so we took control of that function. That would be
why you could see multiple reports of bad blocks showing up in subsequent
scrubs. The disk did not do the reassignment and so this bad spot was left
on the media.
As of 5.3.2 messages would appear to the effect of:
Sun Apr 30 04:38:50 MDT [isp2100_main]: Disk 5.14: sector 33601609 will be
reassigned
Reassignment means the device uses a different piece of media to store
information for some block address. Not all errors returned from a disk can
be handled by a block reassignment - really only those that come back as
unrecoverable media errors can you repair by performing a block level
reassignment.
Should the reassigment fail for some reason then the disk is failed as
sector-wise errors can lead to large reliability issues.
Tony
--------------------------
Tony Aiello, Mgr. Storage Software
mailto:taiello@netapp.com
Ph:(408)822-6515
> -----Original Message-----
> From: Robert L. Millner [mailto:rmillner@transmeta.com]
> Sent: Wednesday, May 03, 2000 10:08 AM
> To: toasters
> Subject: Re: raid failure
>
>
> Hey,
>
> GDG> autosupport messages too. As I went back through the autosupport
> GDG> logs that are e-mailed to me each week, I found that the problem
> GDG> began approximately two weeks earlier. Every time a the
> disk tried
> GDG> to read a particular sector of the disk, an error messages would
> GDG> appear the messages log indicating such an event had
> occurred. Had
> GDG> I not been busily working other issues, to the detriment of my
> GDG> filers, I would have failed this disk at lease a week prior.
>
>
> My immediate question to Netapp in this case would be why was the
> periodic disk scrubbing not sufficient to cause the failed
> sectors to be
> replaced (this was going on for two weeks)? Why upon detection of the
> block failure (after all, if a log message is generated, then
> the filer
> knows it happened) was the data not immediately reconstructed
> elsewhere
> and the disk blocks marked as unusable? A block sized RAID
> reconstruction and re-write should be a trivial problem for
> the filer to
> solve. This is the kind of detail I'd expect a storage
> vendor to place
> a much higher priority on than having a java GUI. This is a
> well known
> way that disks fail; not some mysterious voodoo issue. I worry about
> what other well known failure modes were left out till a later release
> of ONTAP.
>
>
> GDG> occurred. Had I not been busily working other issues, to the
> GDG> detriment of my filers, I would have failed this disk at lease
> GDG> a week prior.
>
> Had Netapp not been busily working other issues, to the
> detriment of you
> and your user's time and data this disk would have failed
> itself or the
> filer would have taken some other corrective action on its own. You
> should have your own automated methods for looking for
> problems (like a
> script which analyses the logs and reports problems back to
> you). Don't
> be afraid to turn into a nasty bastard in a situation like this. None
> of my users would hesitate for a moment and that may be your last
> recourse to making sure that people understand the priority of certain
> kinds of issues.
>
>
> I realize that I am being brutal to Netapp here but that kind
> of failure
> would cost us more than twice what we have invested in our
> entire Netapp
> infrastructure in time to rebuild the data. It gives me that cold,
> prickly, paranoid feeling about all the data we have on our filers. I
> also realize that there are other potential problems that would have
> caused a dual disk failure in one raid group. This specific problem
> should have been dealt with more gracefully by the filer on
> its own. If
> it didn't, then your case alone should have been enough to
> put it on the
> 'Must Fix This Immediately!' list.
>
>
>
> Rob
>
> "You're just the little bundle of negative reinforcement I've
> been looking for." -Mr. Gone
>