sirbruce(a)ix.netcom.com writes:
>As to what happened, the most *likely* scenario for any signficant
>downtime of the Netapp is double disk failure. That is, one disk
>failed, and during reconstruction they lost another.
I beg to differ. YahooMail has had significant problems with all sorts of
unpleasant little bugs which has led to downtime for the system and in two
cases actual data loss. Since the beginning of this year we have apparently
run into every possible failure mode of these systems, except for a dual
disk failure, and I can assure you that there are all sorts of reasons that
an event like this could occur. Reading the tea leaves on this one I would
guess that they also run into bug 1549 and were forced to run a wack on a
filer (props to John E. for the fast wack, if it were not for that single
improvement I can assure everyone at NetApp reading this that Yahoo would be
using EMC boxes right now...)
>If they were running a cluster, most likely there would have been
>no interruption in service. You get what you pay for. I'm sure
>their Apache (or whatever) web servers have crashed more than their
>filers have.
How I wish we would get what we paid for... We run a hacked version of
Apache and have probably had a total of 30 minutes of system downtime due to
this Apache problems this year (mostly due to a memory leak I think that we
introduced into our version.) Contrast that with a total of 36 hours of
system downtime since January due to filer failures. If NetApp expects us
to pay twice as much per MB of storage and run a cluster then they are going
to start getting into the realm where an EMC box begins to look like a
better deal, and when it comes to system reliability NetApp still has a long
way to go to reach the sort of reputation that EMC has in the industry.
NetApp had a very serious problem with quality control on hardware and QA
within the engineering department was so useless as to be non-existent. We
have been given assurances that this specific problem as been recognized and
is being addressed. Only time will tell.
>There doesn't seem to be much of a story here.
But there is a story here, and one which needs to be addressed. NetApp
charges a large premium over the raw costs of the disks and hardware to hook
them together compared to what I could rig up with a PC and a few raid
controllers, as a customer I would like to know that I am getting some value
for this addtional cost. When that "value-add" proves to be a mirage I
think that I have a justifiable reason to be pissed off. Critical Path was
just the first customer to admit that NetApp was the source of the downtime,
the prompt attention we have received from NetApp (even when there was
nothing they could do) is the only thing which prevented Yahoo from being
the first one to attribute responsibility for system downtime to NetApp in a
press release.
jim mccoy
Yahoo! Inc.