sirbruce@ix.netcom.com writes:
As to what happened, the most *likely* scenario for any signficant downtime of the Netapp is double disk failure. That is, one disk failed, and during reconstruction they lost another.
I beg to differ. YahooMail has had significant problems with all sorts of unpleasant little bugs which has led to downtime for the system and in two cases actual data loss. Since the beginning of this year we have apparently run into every possible failure mode of these systems, except for a dual disk failure, and I can assure you that there are all sorts of reasons that an event like this could occur. Reading the tea leaves on this one I would guess that they also run into bug 1549 and were forced to run a wack on a filer (props to John E. for the fast wack, if it were not for that single improvement I can assure everyone at NetApp reading this that Yahoo would be using EMC boxes right now...)
If they were running a cluster, most likely there would have been no interruption in service. You get what you pay for. I'm sure their Apache (or whatever) web servers have crashed more than their filers have.
How I wish we would get what we paid for... We run a hacked version of Apache and have probably had a total of 30 minutes of system downtime due to this Apache problems this year (mostly due to a memory leak I think that we introduced into our version.) Contrast that with a total of 36 hours of system downtime since January due to filer failures. If NetApp expects us to pay twice as much per MB of storage and run a cluster then they are going to start getting into the realm where an EMC box begins to look like a better deal, and when it comes to system reliability NetApp still has a long way to go to reach the sort of reputation that EMC has in the industry.
NetApp had a very serious problem with quality control on hardware and QA within the engineering department was so useless as to be non-existent. We have been given assurances that this specific problem as been recognized and is being addressed. Only time will tell.
There doesn't seem to be much of a story here.
But there is a story here, and one which needs to be addressed. NetApp charges a large premium over the raw costs of the disks and hardware to hook them together compared to what I could rig up with a PC and a few raid controllers, as a customer I would like to know that I am getting some value for this addtional cost. When that "value-add" proves to be a mirage I think that I have a justifiable reason to be pissed off. Critical Path was just the first customer to admit that NetApp was the source of the downtime, the prompt attention we have received from NetApp (even when there was nothing they could do) is the only thing which prevented Yahoo from being the first one to attribute responsibility for system downtime to NetApp in a press release.
jim mccoy Yahoo! Inc.
In the immortal words of Jim McCoy (mccoy@yahoo-inc.com):
Reading the tea leaves on this one I would guess that they also run into bug 1549 and were forced to run a wack on a filer (props to John E. for the fast wack, if it were not for that single improvement I can assure everyone at NetApp reading this that Yahoo would be using EMC boxes right now...)
Out of morbid curiosity, which is bug 1549? I have had to run wack in recent memory, and it was a...sobering experience.
-n
------------------------------------------------------memory@blank.org I've got more than one membership / to more than one club and I owe my life / to the people that I love. (--Ani DiFranco) http://www.blank.org/memory/------------------------------------------
On Mon, 17 May 1999, Nathan J. Mehl wrote:
Out of morbid curiosity, which is bug 1549? I have had to run wack in recent memory, and it was a...sobering experience.
Bug ID: 1549 Product: Data ONTAP
Title: System crashes with the "write_alloc.c:xxx: Assertion failure" message.
Problem: When the filer removes a file, it tries to resize the filer's inode and free a block associated with the inode. However, when the filer is verifying whether the block to be freed is part of the active file system, if the block is already free, the filer crashes.
Workaround: Run the wack utility on the filer. For more information about wack, contact Network Appliance Technical Support.
Solution or Fix: Release: -
From what I've heard, 'wack' and 'fastwack' arent always the cure-all. Kind of dissapointing to see happen after a 20-hour wack run.
matto
P.S.: NATHAN SAYS "HI"!!##@!%
--matt@snark.net---------------------------------------------<darwin>< Matt Ghali MG406/GM023JP tokyo refugee - system admin - pop-tart fan www.hello-kitty.net "WWW my testicles!" - Bob Allisat, net.kook
I've been bit by 1549 as well, and had to run wack. One thing I've noticed, is that whenever I've had to run wack (too often), it finds about a bazillion things to fix. What I'd like to see is a version of wack that I can run from the console, when the filer is up, in a mode similar to "fsck -n". This way I could periodically check my filesystem's integrity, and perhaps schedule downtime, to run wack, before the corruption gets too great. Better yet would be that NetApp fix the problems that cause the corruption in the first place. Failing that though, I would prefer to know about corruption of my filesystem BEFORE I have a system crash.
-ste
1999-05-17-19:50:58 Jim McCoy:
NetApp charges a large premium over the raw costs of the disks and hardware to hook them together compared to what I could rig up with a PC and a few raid controllers, [...]
Personally, I disagree with this statement; if that were what NetApps sold, I'd never buy or recommend 'em.
Rather, NetApps sell two major things I am happy to pay for. I like their filesystem WAFL and the way it refuses to degrade in performance no matter how you abuse it, and of course I love the way they integrate a humongous battery-backed write cache.
I'm continuing to harbor this hope and prayer that Reiserfs may deliver the functionality I most prize in WAFL. I sure hope it plays nice with Linux's RAID. Then all I'd need is a nice integration of a battery-backed RAM board and voila, cheap appliances.
-Bennett