sirbruce@ix.netcom.com writes:
As to what happened, the most *likely* scenario for any signficant downtime of the Netapp is double disk failure. That is, one disk failed, and during reconstruction they lost another.
I beg to differ. YahooMail has had significant problems with all sorts of unpleasant little bugs which has led to downtime for the system and in two cases actual data loss. Since the beginning of this year we have apparently run into every possible failure mode of these systems, except for a dual disk failure, and I can assure you that there are all sorts of reasons that an event like this could occur. Reading the tea leaves on this one I would guess that they also run into bug 1549 and were forced to run a wack on a filer (props to John E. for the fast wack, if it were not for that single improvement I can assure everyone at NetApp reading this that Yahoo would be using EMC boxes right now...)
If they were running a cluster, most likely there would have been no interruption in service. You get what you pay for. I'm sure their Apache (or whatever) web servers have crashed more than their filers have.
How I wish we would get what we paid for... We run a hacked version of Apache and have probably had a total of 30 minutes of system downtime due to this Apache problems this year (mostly due to a memory leak I think that we introduced into our version.) Contrast that with a total of 36 hours of system downtime since January due to filer failures. If NetApp expects us to pay twice as much per MB of storage and run a cluster then they are going to start getting into the realm where an EMC box begins to look like a better deal, and when it comes to system reliability NetApp still has a long way to go to reach the sort of reputation that EMC has in the industry.
NetApp had a very serious problem with quality control on hardware and QA within the engineering department was so useless as to be non-existent. We have been given assurances that this specific problem as been recognized and is being addressed. Only time will tell.
There doesn't seem to be much of a story here.
But there is a story here, and one which needs to be addressed. NetApp charges a large premium over the raw costs of the disks and hardware to hook them together compared to what I could rig up with a PC and a few raid controllers, as a customer I would like to know that I am getting some value for this addtional cost. When that "value-add" proves to be a mirage I think that I have a justifiable reason to be pissed off. Critical Path was just the first customer to admit that NetApp was the source of the downtime, the prompt attention we have received from NetApp (even when there was nothing they could do) is the only thing which prevented Yahoo from being the first one to attribute responsibility for system downtime to NetApp in a press release.
jim mccoy Yahoo! Inc.