I have lived through this harrowing experience. Yes there were warning signs but as I was busy performing other tasks I did not see them. The moral of my story is to read your weekly message logs and have good back ups. I also file every weekly autosupport message. I have message logs from everyone of my filers, we currently have twenty-one, dating back to 1996 when I started this position.
It began with a phone call in the middle of the night. On the other end was a co-worker confused as to why the filer was no longer serving data. She had made the determination that there was a failed disk, but little did she know what the next fifteen hours would bring. The reported disk failure was being reconstructed but would never complete. That is when I got the phone call. By this time it was just after 10:00 PM and I rushed into work. I immediately got the run down on the problem she was seeing and what actions were taken. From that point I made a few stabs at getting the RAID to reconstruct. My attempts also failed, now we had to call NetApp support. I made the priority one call and was soon contacted by a NetApp engineer. After a couple of hours poking around on the filer, we were able to determine the sequence of events leading up to the nights exercises in futility. Approximately two weeks prior, the filer had reported in the messages logs that a disk in RAID group zero on volume zero, the root volume, had an unrecoverable read error but did not fail the disk. This night, now closer to morning, the system failed a different disk in RAID group zero on volume zero. See the connection? The net result was the filer could not reconstruct the data because of the unrecoverable read error on the earlier disk. The filer could not make any sense of the data coming off the corrupted sector of the disk. By this time our local support people were awaken by the central support team and had joined us. We were starting to draw a crowd. It was closing in on 2:00 AM when we tried to reconstruct the disk with the bad read sector manually but after a couple of failed attempts even the NetApp engineer was beginning to sound depressed. Then came the question. The one that all data managers hate to hear. How good are your backups? To this question I replied, "we are about to find out."
Since our root volume failed, the first step was to make a new root volume. I made /vol/vol1 our new root volume. Next we destroyed /vol/vol0 and removed all the troubled disk drives. Since all of these disks are now spares, we may create a new RAID group and a new /vol/vol0. Here it is important to note that there must be a /etc on the volume *before* Veritas can begin its data restore. Once the volume was recreated, we began our restore of approximately 150 Gigabytes of data which took us well into the morning.
The NetApp engineer in California and I had a long conversation over the phone while the data restores were beginning. He was certain that there should have been some sort of sign that this was about to happen. I assured him that I would look and would forward my results to our local support people, who are copied on the autosupport messages too. As I went back through the autosupport logs that are e-mailed to me each week, I found that the problem began approximately two weeks earlier. Every time a the disk tried to read a particular sector of the disk, an error messages would appear the messages log indicating such an event had occurred. Had I not been busily working other issues, to the detriment of my filers, I would have failed this disk at lease a week prior.
-gdg
Mark D Fowle wrote:
I have heard a few horror stories lately about netapps and multi-disk raid failures. Has anyone out there experienced this and what did you do for recovery ? Where there any warnings? I have not had this happen and would like to do as much as possible to prevent it.
Thanks,
======= Mark Fowle Caterpillar/BCP Cary North Carolina =============================================================================== =======
-- --------------------------------------------------------------- G D Geen mailto:geen@ti.com Texas Instruments Phone : (214)480.7896 System Administrator FAX : (214)480.7676 --------------------------------------------------------------- Life is what happens while you're busy making other plans. -J. Lennon
What about the case when a whole shelf goes away at once (power is lost to the shelf for instance)? This seems more likely than a multiple disk failure. Is this recoverable?
-- Mike
On Wed, 3 May 2000, Mike Mueller wrote:
What about the case when a whole shelf goes away at once (power is lost to the shelf for instance)? This seems more likely than a multiple disk failure. Is this recoverable?
Yes, loss of SCSI/FibreChannel connectivity or power to a shelf will cause the filer to halt or reboot (I forget exactly which... it's been a while since we did our failure mode testing). No data is lost. It's a good idea to keep a spare (empty) shelf and shelf power supplies on-site to minimize repair time should this ever happen.
On Wed, May 03, 2000 at 06:14:05PM -0400, Brian Tao wrote:
On Wed, 3 May 2000, Mike Mueller wrote:
What about the case when a whole shelf goes away at once (power is lost to the shelf for instance)? This seems more likely than a multiple disk failure. Is this recoverable?
Yes, loss of SCSI/FibreChannel connectivity or power to a shelf
will cause the filer to halt or reboot (I forget exactly which... it's been a while since we did our failure mode testing). No data is lost.
halt, which makes sense, since if the shelf was still powered down, or the loop broken, it would have to repeat the same cycle on reboot.
-s
On Sat, 6 May 2000, Steve Armijo wrote:
On Wed, May 03, 2000 at 06:14:05PM -0400, Brian Tao wrote:
Yes, loss of SCSI/FibreChannel connectivity or power to a shelf
will cause the filer to halt or reboot (I forget exactly which... it's been a while since we did our failure mode testing). No data is lost.
halt, which makes sense, since if the shelf was still powered down, or the loop broken, it would have to repeat the same cycle on reboot.
That's what I thought at first, but I had the notion that the filer reboots on the first try and then halting if it encounters the same error condition (to go with its "let's do a quick reboot and see if that wakes anything up" philosophy). Perhaps not. There was a case I ran across a few years ago where the filer would only try rebooting itself 5 times in a row to clear a problem, and then would give up and halt.
I have lived through this harrowing experience. Yes there were warning signs but as I was busy performing other tasks I did not see them. The moral of my story is to read your weekly message logs and have good back ups. I also file every weekly autosupport message. I have message logs from everyone of my filers, we currently have twenty-one, dating back to 1996 when I started this position.
I have a nightly cron job that emails me the interesting lines from /etc/messages. Our filer gets tons of disk quota exceeded messages because we have over 20,000 users and each one has a quota. And I also filter out those hourly status messages. There's no way I can wade through those weekly emails because /etc/messages is usually about 5000 lines long. I'd probably forget to check the logs by hand, but I always read my email.
This is roughly how my cron script emails me:
MAILTO=mailid@mailhost
( echo "Subject: netapp daily log" echo egrep -v 'NFS ops|quota exceeded|message repeated' /na/etc/messages ) | /usr/lib/sendmail $MAILTO
Steve Losen scl@virginia.edu phone: 804-924-0640
University of Virginia ITC Unix Support
I have lived through this harrowing experience. Yes there were warning signs but as I was busy performing other tasks I did not see them. The moral of my story is to read your weekly message logs and have good back ups. I also file every weekly autosupport message.
We have tools within NetApp which run early each morning to analyze the autosupports we receive. I believe some types of problems (a disk, fan, or power supply failure) will already automatically trigger the creation of a support case and, where possible, notification to the customer so that corrective action can be taken if spares are on site. Other problems aren't automated yet. We're improving parts of the analysis at a rapid pace, so many errors which aren't yet caught will be dealt with in the near future.
Unfortunately, this only works if we get the autosupport in the first place. Apparently there are a number of customers who think we don't want to be bothered. We do! If your filer or NetCache doesn't list 'autosupport@netapp.com' in the 'autosupport.to' option, please add it if your policies allow this.
We understand that some customers can't take advantage of this due to security or confidentiality concerns. A longer-term goal is to wrap the analysis tools in a package that customers can run themselves.
-- Karl Swartz Network Appliance Engineering Work: kls@netapp.com http://www.netapp.com/ Home: kls@chicago.com http://www.chicago.com/~kls/