Hello,
I don't see the reference to the version of OnTap used but perhaps I can relate some information.
As of the 5.3.2 release we added functionality to reassign blocks as they occur. Prior to that we enabled automatic reassignment features on the disk drives to do this. We found that the disks did not handle the reassignment in all cases we'd like so we took control of that function. That would be why you could see multiple reports of bad blocks showing up in subsequent scrubs. The disk did not do the reassignment and so this bad spot was left on the media.
As of 5.3.2 messages would appear to the effect of: Sun Apr 30 04:38:50 MDT [isp2100_main]: Disk 5.14: sector 33601609 will be reassigned
Reassignment means the device uses a different piece of media to store information for some block address. Not all errors returned from a disk can be handled by a block reassignment - really only those that come back as unrecoverable media errors can you repair by performing a block level reassignment.
Should the reassigment fail for some reason then the disk is failed as sector-wise errors can lead to large reliability issues.
Tony -------------------------- Tony Aiello, Mgr. Storage Software mailto:taiello@netapp.com Ph:(408)822-6515
-----Original Message----- From: Robert L. Millner [mailto:rmillner@transmeta.com] Sent: Wednesday, May 03, 2000 10:08 AM To: toasters Subject: Re: raid failure
Hey,
GDG> autosupport messages too. As I went back through the autosupport GDG> logs that are e-mailed to me each week, I found that the problem GDG> began approximately two weeks earlier. Every time a the disk tried GDG> to read a particular sector of the disk, an error messages would GDG> appear the messages log indicating such an event had occurred. Had GDG> I not been busily working other issues, to the detriment of my GDG> filers, I would have failed this disk at lease a week prior.
My immediate question to Netapp in this case would be why was the periodic disk scrubbing not sufficient to cause the failed sectors to be replaced (this was going on for two weeks)? Why upon detection of the block failure (after all, if a log message is generated, then the filer knows it happened) was the data not immediately reconstructed elsewhere and the disk blocks marked as unusable? A block sized RAID reconstruction and re-write should be a trivial problem for the filer to solve. This is the kind of detail I'd expect a storage vendor to place a much higher priority on than having a java GUI. This is a well known way that disks fail; not some mysterious voodoo issue. I worry about what other well known failure modes were left out till a later release of ONTAP.
GDG> occurred. Had I not been busily working other issues, to the GDG> detriment of my filers, I would have failed this disk at lease GDG> a week prior.
Had Netapp not been busily working other issues, to the detriment of you and your user's time and data this disk would have failed itself or the filer would have taken some other corrective action on its own. You should have your own automated methods for looking for problems (like a script which analyses the logs and reports problems back to you). Don't be afraid to turn into a nasty bastard in a situation like this. None of my users would hesitate for a moment and that may be your last recourse to making sure that people understand the priority of certain kinds of issues.
I realize that I am being brutal to Netapp here but that kind of failure would cost us more than twice what we have invested in our entire Netapp infrastructure in time to rebuild the data. It gives me that cold, prickly, paranoid feeling about all the data we have on our filers. I also realize that there are other potential problems that would have caused a dual disk failure in one raid group. This specific problem should have been dealt with more gracefully by the filer on its own. If it didn't, then your case alone should have been enough to put it on the 'Must Fix This Immediately!' list.
Rob
"You're just the little bundle of negative reinforcement I've been looking for." -Mr. Gone
----- Original Message ----- From: "Aiello, Tony" Tony.Aiello@netapp.com To: "'Robert L. Millner'" rmillner@transmeta.com; "toasters" toasters@mathworks.com Sent: Wednesday, May 03, 2000 11:31 AM Subject: RE: raid failure
Hello,
I don't see the reference to the version of OnTap used but perhaps I can relate some information.
GD didn't say exactly what the error message was from prior problem. either.
Possibly what happened was that it was a drive error that WAS recoverable. Netapp will log when it has trouble talking to a drive but won't fail it so long as it eventually succeeds. It would be wrong to fail a drive simply because it temporarily took long to respond.
Also, I believe in the past if there was a read error, the block would not be reassigned but the block would be rewritten using the parity information. However, in rare cases you could have a "weak" block where writes appeared to succeed at first but subsequent reads would eventually fail.
In any case, I don't think it is necessarily Netapp's fault for not failing the drive. Transiet disk errors can occur, and you can only program so many heuristics into the Netapp OS. It is entirely possible for such an event to happen and the customer not have another disk failure or for the problem not to resurface in reconstruction. But a previous poster asked what they could do to minimize the prospects even further... to do that, it means you fail the drive as soon as anything looks like it might be wrong with it. The result, of course, is you spend more money on drives.
Bruce