Hey,
GDG> autosupport messages too. As I went back through the autosupport GDG> logs that are e-mailed to me each week, I found that the problem GDG> began approximately two weeks earlier. Every time a the disk tried GDG> to read a particular sector of the disk, an error messages would GDG> appear the messages log indicating such an event had occurred. Had GDG> I not been busily working other issues, to the detriment of my GDG> filers, I would have failed this disk at lease a week prior.
My immediate question to Netapp in this case would be why was the periodic disk scrubbing not sufficient to cause the failed sectors to be replaced (this was going on for two weeks)? Why upon detection of the block failure (after all, if a log message is generated, then the filer knows it happened) was the data not immediately reconstructed elsewhere and the disk blocks marked as unusable? A block sized RAID reconstruction and re-write should be a trivial problem for the filer to solve. This is the kind of detail I'd expect a storage vendor to place a much higher priority on than having a java GUI. This is a well known way that disks fail; not some mysterious voodoo issue. I worry about what other well known failure modes were left out till a later release of ONTAP.
GDG> occurred. Had I not been busily working other issues, to the GDG> detriment of my filers, I would have failed this disk at lease GDG> a week prior.
Had Netapp not been busily working other issues, to the detriment of you and your user's time and data this disk would have failed itself or the filer would have taken some other corrective action on its own. You should have your own automated methods for looking for problems (like a script which analyses the logs and reports problems back to you). Don't be afraid to turn into a nasty bastard in a situation like this. None of my users would hesitate for a moment and that may be your last recourse to making sure that people understand the priority of certain kinds of issues.
I realize that I am being brutal to Netapp here but that kind of failure would cost us more than twice what we have invested in our entire Netapp infrastructure in time to rebuild the data. It gives me that cold, prickly, paranoid feeling about all the data we have on our filers. I also realize that there are other potential problems that would have caused a dual disk failure in one raid group. This specific problem should have been dealt with more gracefully by the filer on its own. If it didn't, then your case alone should have been enough to put it on the 'Must Fix This Immediately!' list.
Rob
"You're just the little bundle of negative reinforcement I've been looking for." -Mr. Gone
"Robert L. Millner" wrote:
Hey,
GDG> autosupport messages too. As I went back through the autosupport GDG> logs that are e-mailed to me each week, I found that the problem GDG> began approximately two weeks earlier. Every time a the disk tried GDG> to read a particular sector of the disk, an error messages would GDG> appear the messages log indicating such an event had occurred. Had GDG> I not been busily working other issues, to the detriment of my GDG> filers, I would have failed this disk at lease a week prior.
My immediate question to Netapp in this case would be why was the periodic disk scrubbing not sufficient to cause the failed sectors to be replaced (this was going on for two weeks)? Why upon detection of the block failure (after all, if a log message is generated, then the filer knows it happened) was the data not immediately reconstructed elsewhere and the disk blocks marked as unusable? A block sized RAID reconstruction and re-write should be a trivial problem for the filer to solve. This is the kind of detail I'd expect a storage vendor to place a much higher priority on than having a java GUI. This is a well known way that disks fail; not some mysterious voodoo issue. I worry about what other well known failure modes were left out till a later release of ONTAP.
First and foremost, I take complete responsibility for my filers. I did so in my message to my management which was forwarded up through VP level. Having said that, I do agree with you. This disk should have failed by the filer two weeks prior. We at TI are now pushing NetApp to be proactive in producing a stable product. I have administrated NAFS 1300/1400, F300s, F500s, and F700s. With each new rendition of the hardware, I have found the reliability diminished. At the Customer Advisory Council I advised them to make a more stable system. Recently representatives from Texas Instruments, Inc. again told NetApp to provide a more stable system. I, personally, am finding harder to defend NetApp. Over-all they have good service after the sale -- if you live in North America.
I still prefer to look at the autosupport messages. There is so much that I glean from these. The information is not just what disks to watch out for but are the network interfaces being over-run. I have begun writing a rather large complement of tools in Perl to make the job easier as the number of filers continues to grow. We are also looking at what other storage vendors can do for us. This is not just an issues with failures, the NetApp filers are no longer able to supply the power needed at peak times in our process.
I cannot disclose what fifteen hours of down time cost Texas Instruments, Inc. as that is proprietary information though we were fortunate that this was over a holiday weekend.
-gdg
-- --------------------------------------------------------------- G D Geen mailto:geen@ti.com Texas Instruments Phone : (214)480.7896 System Administrator FAX : (214)480.7676 --------------------------------------------------------------- Life is what happens while you're busy making other plans. -J. Lennon