sirbruce@ix.netcom.com wrote:
someone said: ...We recently encountered a disk controller failure on one of our data servers (not on our Netapp). The problem was that this failure was not a complete failure of the drive itself, but rather the controller started to die slowly. ... ...should this occur on a Netapp ... would it cause the entire filesystem to go corrupt and cause partial or complete data loss as in this case? ...
I would have to say yes, it's *possible*, but the controller would have to fail in a very odd way; ...
I'll give you odd. Two days after it was put into service, our F740 crashed, with a "WAFL hung" message. Later that morning, it threw a disk and started to reconstruct it on another disk. When ever the reconstruction reach exactly 51%, the system would freeze and crash, again due to "WAFL hung". This happened repeatedly, despite our and NetApp's best efforts. After about 10 hours or so of this, NetApp said the cores we were sending them were complaining about too many problems with too many disks for it to be a real disk problem and they decided it wasn't the disks but the FC controller. We swapped the system board, held our breath as it reached - and passed - 51%. After a total of 12 hours down, we were back in business and the filer has been up ever since (10.5 days as I write this). We've been very happy with the filer since the incident, but I would say that that failure qualifies as "odd", wouldn't you?
-ste