toasters May 2000

toasters@lists.teaparty.net

122 participants
148 discussions

Re: raid failure
by Robert L. Millner 03 May '00

03 May '00

"AT" == Aiello, Tony <Tony.Aiello(a)netapp.com>: "GDG" == G D Geen <geen(a)msp.sc.ti.com>: AT> As of the 5.3.2 release we added functionality to reassign blocks as they AT> occur. Prior to that we enabled automatic reassignment features on the disk AT> drives to do this. We found that the disks did not handle the reassignment AT> in all cases we'd like so we took control of that function. That would be The fact that whatever version this customer was running was not able to handle that common type of disk failure scares the hell out of me for what 5.3.5R2P1 will not be able to handle and how it will bite me when something breaks. Why didn't this make it in to ONTAP before something as useless as the Java GUI (ok, thats a personal opinion; but try and convince me that its worth more than the ability to appropriately handle disk failures). Netapp had years of experience from other vendors to draw from when it was formed. Part of that experience should have been comprehensive knowledge of the kinds of failures people see in the field. GDG> First and foremost, I take complete responsibility for my filers. I did so GDG> in my message to my management which was forwarded up through VP level. Of course, and there's always more that can be done at everyone's end. There's a long list of things I'd like to both do and see Netapp do. Important problems that vendors missed in their own tests make me really worry for that vendor's Q/A and what a product will cost me later when something pathological happens (after all, that's part of the cost of running the box). Nobody gets everything right immediately and that includes Netapp. Nobody's server is perfect for my environment. Everything is a balance of costs and some of those costs are how well it works on its own. What they miss has alot to do with what I consider my mistake to have been. Was my mistake not buying an EMC or a Sun in the first place or was my mistake not noticing the autosupport message (for example)? GDG> Having said that, I do agree with you. This disk should have failed by the GDG> filer two weeks prior. We at TI are now pushing NetApp to be proactive in We have been discussing certain things we'd like to see in their product as well (in case any one is curious I can elaborate more here). GDG> I still prefer to look at the autosupport messages. There is so much that I GDG> glean from these. The information is not just what disks to watch out for Everything in the autosupport messages can be gleaned from commands you run on the filer and checking the logs. I'd be really interested in hearing what kinds of tools you are working on if you don't mind sharing that info. GDG> I cannot disclose what fifteen hours of down time cost Texas Instruments, GDG> Inc. as that is proprietary information though we were fortunate that this GDG> was over a holiday weekend. *shiver* Rob

2 1

Data OnTap 5.3.5P2 and NDMP Backups
by Linn, Greg 03 May '00

03 May '00

A NDMP-based backup problem has been identified in our Data OnTap 5.3.5P2 patch release. During development of Data OnTap 5.3.6, a deadlock condition was discovered which impacted the NDMP Java implementation. This condition typically manifests itself as an NDMP connection failure during resource depletion conditions. We believe it was first introduced 5.3.5P2. The problem has been resolved in 5.3.5R2P2 and in the forthcoming 5.3.6 release. If you are running 5.3.5P2, and are experiencing NDMP backup problems similar to the connection failure described above, we recommend that you upgrade to 5.3.5R2P2. Greg Linn Manager, NDMP Development linn(a)netapp.com 408.822.3752 telephone 408.822.4457 fax

2 1

Re: raid failure
by Robert L. Millner 03 May '00

03 May '00

SL> filter out those hourly status messages. There's no way I can wade SL> through those weekly emails because /etc/messages is usually about SL> 5000 lines long. I'd probably forget to check the logs by hand, but Swatch is another really good tool for doing this. It can be used to compact a large number of entries into a summary and weed out useless information. Rob

1 0

RE: raid failure
by Aiello, Tony 03 May '00

03 May '00

Hello, I don't see the reference to the version of OnTap used but perhaps I can relate some information. As of the 5.3.2 release we added functionality to reassign blocks as they occur. Prior to that we enabled automatic reassignment features on the disk drives to do this. We found that the disks did not handle the reassignment in all cases we'd like so we took control of that function. That would be why you could see multiple reports of bad blocks showing up in subsequent scrubs. The disk did not do the reassignment and so this bad spot was left on the media. As of 5.3.2 messages would appear to the effect of: Sun Apr 30 04:38:50 MDT [isp2100_main]: Disk 5.14: sector 33601609 will be reassigned Reassignment means the device uses a different piece of media to store information for some block address. Not all errors returned from a disk can be handled by a block reassignment - really only those that come back as unrecoverable media errors can you repair by performing a block level reassignment. Should the reassigment fail for some reason then the disk is failed as sector-wise errors can lead to large reliability issues. Tony -------------------------- Tony Aiello, Mgr. Storage Software mailto:taiello@netapp.com Ph:(408)822-6515 > -----Original Message----- > From: Robert L. Millner [mailto:rmillner@transmeta.com] > Sent: Wednesday, May 03, 2000 10:08 AM > To: toasters > Subject: Re: raid failure > > > Hey, > > GDG> autosupport messages too. As I went back through the autosupport > GDG> logs that are e-mailed to me each week, I found that the problem > GDG> began approximately two weeks earlier. Every time a the > disk tried > GDG> to read a particular sector of the disk, an error messages would > GDG> appear the messages log indicating such an event had > occurred. Had > GDG> I not been busily working other issues, to the detriment of my > GDG> filers, I would have failed this disk at lease a week prior. > > > My immediate question to Netapp in this case would be why was the > periodic disk scrubbing not sufficient to cause the failed > sectors to be > replaced (this was going on for two weeks)? Why upon detection of the > block failure (after all, if a log message is generated, then > the filer > knows it happened) was the data not immediately reconstructed > elsewhere > and the disk blocks marked as unusable? A block sized RAID > reconstruction and re-write should be a trivial problem for > the filer to > solve. This is the kind of detail I'd expect a storage > vendor to place > a much higher priority on than having a java GUI. This is a > well known > way that disks fail; not some mysterious voodoo issue. I worry about > what other well known failure modes were left out till a later release > of ONTAP. > > > GDG> occurred. Had I not been busily working other issues, to the > GDG> detriment of my filers, I would have failed this disk at lease > GDG> a week prior. > > Had Netapp not been busily working other issues, to the > detriment of you > and your user's time and data this disk would have failed > itself or the > filer would have taken some other corrective action on its own. You > should have your own automated methods for looking for > problems (like a > script which analyses the logs and reports problems back to > you). Don't > be afraid to turn into a nasty bastard in a situation like this. None > of my users would hesitate for a moment and that may be your last > recourse to making sure that people understand the priority of certain > kinds of issues. > > > I realize that I am being brutal to Netapp here but that kind > of failure > would cost us more than twice what we have invested in our > entire Netapp > infrastructure in time to rebuild the data. It gives me that cold, > prickly, paranoid feeling about all the data we have on our filers. I > also realize that there are other potential problems that would have > caused a dual disk failure in one raid group. This specific problem > should have been dealt with more gracefully by the filer on > its own. If > it didn't, then your case alone should have been enough to > put it on the > 'Must Fix This Immediately!' list. > > > > Rob > > "You're just the little bundle of negative reinforcement I've > been looking for." -Mr. Gone >

2 1

Re: raid failure
by Jay Orr 03 May '00

03 May '00

On 3 May 2000, Mark D Fowle wrote: > I have heard a few horror stories lately about netapps and multi-disk raid > failures. Has anyone out there experienced this > and what did you do for recovery ? Where there any warnings? I have not had > this happen and would like to do as much > as possible to prevent it. I'll say this much for these Hardy Beasts - we had our A/C die on us overnight once, and I came in to find our filer dead. We had to swap out ALL the parts on the filer to bring it back up (it's a F330 we've had a few years and the room was 90+ degrees). However, didn't loose a drive! Knock on wood, we've never had a two-disk failure. To me, this illustrates that the chances of a double drive failure are quite low. Also, I'm always look at the drive lights as I walk by to make sure I didn't miss a log message about a drive failure. my $0.02... ----------- Jay Orr Systems Administrator Fujitsu Nexion Inc. St. Louis, MO

2 1

Re: raid failure
by Robert L. Millner 03 May '00

03 May '00

Hey, GDG> autosupport messages too. As I went back through the autosupport GDG> logs that are e-mailed to me each week, I found that the problem GDG> began approximately two weeks earlier. Every time a the disk tried GDG> to read a particular sector of the disk, an error messages would GDG> appear the messages log indicating such an event had occurred. Had GDG> I not been busily working other issues, to the detriment of my GDG> filers, I would have failed this disk at lease a week prior. My immediate question to Netapp in this case would be why was the periodic disk scrubbing not sufficient to cause the failed sectors to be replaced (this was going on for two weeks)? Why upon detection of the block failure (after all, if a log message is generated, then the filer knows it happened) was the data not immediately reconstructed elsewhere and the disk blocks marked as unusable? A block sized RAID reconstruction and re-write should be a trivial problem for the filer to solve. This is the kind of detail I'd expect a storage vendor to place a much higher priority on than having a java GUI. This is a well known way that disks fail; not some mysterious voodoo issue. I worry about what other well known failure modes were left out till a later release of ONTAP. GDG> occurred. Had I not been busily working other issues, to the GDG> detriment of my filers, I would have failed this disk at lease GDG> a week prior. Had Netapp not been busily working other issues, to the detriment of you and your user's time and data this disk would have failed itself or the filer would have taken some other corrective action on its own. You should have your own automated methods for looking for problems (like a script which analyses the logs and reports problems back to you). Don't be afraid to turn into a nasty bastard in a situation like this. None of my users would hesitate for a moment and that may be your last recourse to making sure that people understand the priority of certain kinds of issues. I realize that I am being brutal to Netapp here but that kind of failure would cost us more than twice what we have invested in our entire Netapp infrastructure in time to rebuild the data. It gives me that cold, prickly, paranoid feeling about all the data we have on our filers. I also realize that there are other potential problems that would have caused a dual disk failure in one raid group. This specific problem should have been dealt with more gracefully by the filer on its own. If it didn't, then your case alone should have been enough to put it on the 'Must Fix This Immediately!' list. Rob "You're just the little bundle of negative reinforcement I've been looking for." -Mr. Gone

2 1

RE: raid failure
by Walters, Mike 03 May '00

03 May '00

Just another tool for your armoury against the unthinkable: SnapMirror. This gives you the ability to keep an asynchronous copy of your volumes on another filer with little overhead. I won't bore you with detail (which you may already know), but you might want to have a scan down http://www.netapp.com/tech_library/3066.html for data protection strategies. Cheers Mike > -----Original Message----- > From: owner-dl-toasters(a)netapp.com > [mailto:owner-dl-toasters@netapp.com]On Behalf Of Mark D Fowle > Sent: 03 May 2000 12:00 > To: toasters > Subject: raid failure > > > I have heard a few horror stories lately about netapps and > multi-disk raid > failures. Has anyone out there experienced this > and what did you do for recovery ? Where there any > warnings? I have not had > this happen and would like to do as much > as possible to prevent it. > > Thanks, > ============================================================ > =================== > ======= > Mark Fowle > Caterpillar/BCP > Cary North Carolina > ============================================================ > =================== > ======= >

1 0

RE: raid failure
by Mohler, Jeff 03 May '00

03 May '00

Absolutely, ONTAP will just stop in its tracks, and will be just fine when you resolve the underlying hardware problem. -----Original Message----- From: Mike Mueller [mailto:Mike.Mueller@jpl.nasa.gov] Sent: Wednesday, May 03, 2000 9:38 AM To: toasters(a)mathworks.com Subject: Re: raid failure What about the case when a whole shelf goes away at once (power is lost to the shelf for instance)? This seems more likely than a multiple disk failure. Is this recoverable? -- Mike

1 0

RE: raid failure
by Mohler, Jeff 03 May '00

03 May '00

I have. If the disk failure is simply a bad block on another drive, Netapp can sometimes product a patch that will "skip" that block and allow you to continue to use the filesystem and then after it's done rebuilding from the first failed disk, you can fail and rebuild the second failed disk. But this can be time consuming since you'll have to recheck your filesystem for possible corruption, and even after you fix that you have some data files somewhere that are corrupted that you don't know about. --- There are actual 'secret ninja' commands & tricks to do this with the OS a customer is currently running, that can be used to get around a severely abused drive or two in the case of an emergency rebuild. I have had to use this with customers who more often than not..have just relocated thier toaster from one rack/building to another, which says a LOT about proper drive treatment..Ive seen people really get slam-happy on replacing drives in the shelves as well..seeding thier own future drive failure issues. Bruce describes the situation rather nicely. BUT..those secret commands and options..are things you will get from Tech Support should you have to need them (truly a rare experience for TS to have to use them) and you wont get em from me. *grin* PS: Renew your support, is your data worth it? http://now.netapp.com

1 0

Re: raid failure
by Bruce Sterling Woodcock 03 May '00

03 May '00

----- Original Message ----- From: "Mark D Fowle" <Fowle_Mark_D(a)CAT.com> To: "toasters" <toasters(a)mathworks.com> Sent: Wednesday, May 03, 2000 4:00 AM Subject: raid failure > I have heard a few horror stories lately about netapps and > multi-disk raid failures. I have heard a lot more horror stories regarding single-disk failures in non-raid situations. That's the whole point of getting raid... to move that pain from the chance of a single disk failure to a double disk failure, thus reducing it substantially. > Has anyone out there experienced this > and what did you do for recovery ? I have. If the disk failure is simply a bad block on another drive, Netapp can sometimes product a patch that will "skip" that block and allow you to continue to use the filesystem and then after it's done rebuilding from the first failed disk, you can fail and rebuild the second failed disk. But this can be time consuming since you'll have to recheck your filesystem for possible corruption, and even after you fix that you have some data files somewhere that are corrupted that you don't know about. Generally, in the double-disk failure case, like every other RAID vendor (unless you're running like a +1 configuration), you lose whatever is in that raid group/filesystem and you have to restore it from a mirrored copy or tape backup. > Where there any warnings? Back before RAID scrubbing, sometimes there were no warnings, because when 1 drive failed and the system went through every block on the other disks rebuilding the filesystem, it would hit a block that had gone bad some time before and yet had not been accessed and boom, you had a double disk failure. Now that you have RAID scrubbing, all those blocks are checked every week or so, and you will get any indications of a possibly bad drive. Also the filer does provide warnings in advance if it's having trouble talking to a particular drive and you can fail it proactively and replace it without waiting for it to fail (and increasing the risk that another drive will fail during that time). > I have not had this happen and would like to do as much > as possible to prevent it. There's not much to be done to prevent it... disks have a MTBF and the chance of two failures in the same time period are non- zero. The best thing to do is follow Netapp's directions regarding the operating environment and rack-mounting of your drives (so you don't have excessive vibration or temperatures or whatnot), watch your logs and fail drives if they appear to be going bad and replace them ASAP. Set your raid reconstruct speed high or disable access to that filesystem during reconstruction to make sure the reconstruction happens as quickly as possible to shorten the time window for a second disk failure. Make sure your RAID scrubs run periodically and check their results. Use smaller RAID group sizes and have frequent backups so that when a failure does occur it is as limited and painless as possible. And only use disk drives from Network Appliance. Bruce

1 0

Jump to page:

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

toasters May 2000