On 06/25/99 12:50:17 you wrote:
Hi there,
We recently encountered a disk controller failure on one of our data servers (not on our Netapp). The problem was that this failure was not a complete failure of the drive itself, but rather the controller started to die slowly. (i.e. handles some request sometimes and other times who knows?) As a result, this caused our entire filesystem to become corrupt and we lost some of our data as a result. Although this filesystem was mirrored, this did not help us at all, as it was considered a logical error in the filesystem and not a hardware problem.
My question is that should this occur on a Netapps (this may even apply to any other Enterprise server) would it cause the entire filesystem to go corrupt and cause partial or complete data loss as in this case?
I would have to say yes, it's *possible*, but the controller would have to fail in a very odd way; not simply not responding the some requests (that would be caught), but executing some and not others (but claiming it did) or misordering commands or something like that. I would think this kind of error would be very rare. Perhaps the controller you had is particularly prone to those sorts of errors; one advantage of Netapp is you're using controllers they themselves have partially designed and tested.
Bruce
sirbruce@ix.netcom.com wrote:
someone said: ...We recently encountered a disk controller failure on one of our data servers (not on our Netapp). The problem was that this failure was not a complete failure of the drive itself, but rather the controller started to die slowly. ... ...should this occur on a Netapp ... would it cause the entire filesystem to go corrupt and cause partial or complete data loss as in this case? ...
I would have to say yes, it's *possible*, but the controller would have to fail in a very odd way; ...
I'll give you odd. Two days after it was put into service, our F740 crashed, with a "WAFL hung" message. Later that morning, it threw a disk and started to reconstruct it on another disk. When ever the reconstruction reach exactly 51%, the system would freeze and crash, again due to "WAFL hung". This happened repeatedly, despite our and NetApp's best efforts. After about 10 hours or so of this, NetApp said the cores we were sending them were complaining about too many problems with too many disks for it to be a real disk problem and they decided it wasn't the disks but the FC controller. We swapped the system board, held our breath as it reached - and passed - 51%. After a total of 12 hours down, we were back in business and the filer has been up ever since (10.5 days as I write this). We've been very happy with the filer since the incident, but I would say that that failure qualifies as "odd", wouldn't you?
-ste