I have lived through this harrowing experience. Yes there were warning signs but as I was busy performing other tasks I did not see them. The moral of my story is to read your weekly message logs and have good back ups. I also file every weekly autosupport message. I have message logs from everyone of my filers, we currently have twenty-one, dating back to 1996 when I started this position.
It began with a phone call in the middle of the night. On the other end was a co-worker confused as to why the filer was no longer serving data. She had made the determination that there was a failed disk, but little did she know what the next fifteen hours would bring. The reported disk failure was being reconstructed but would never complete. That is when I got the phone call. By this time it was just after 10:00 PM and I rushed into work. I immediately got the run down on the problem she was seeing and what actions were taken. From that point I made a few stabs at getting the RAID to reconstruct. My attempts also failed, now we had to call NetApp support. I made the priority one call and was soon contacted by a NetApp engineer. After a couple of hours poking around on the filer, we were able to determine the sequence of events leading up to the nights exercises in futility. Approximately two weeks prior, the filer had reported in the messages logs that a disk in RAID group zero on volume zero, the root volume, had an unrecoverable read error but did not fail the disk. This night, now closer to morning, the system failed a different disk in RAID group zero on volume zero. See the connection? The net result was the filer could not reconstruct the data because of the unrecoverable read error on the earlier disk. The filer could not make any sense of the data coming off the corrupted sector of the disk. By this time our local support people were awaken by the central support team and had joined us. We were starting to draw a crowd. It was closing in on 2:00 AM when we tried to reconstruct the disk with the bad read sector manually but after a couple of failed attempts even the NetApp engineer was beginning to sound depressed. Then came the question. The one that all data managers hate to hear. How good are your backups? To this question I replied, "we are about to find out."
Since our root volume failed, the first step was to make a new root volume. I made /vol/vol1 our new root volume. Next we destroyed /vol/vol0 and removed all the troubled disk drives. Since all of these disks are now spares, we may create a new RAID group and a new /vol/vol0. Here it is important to note that there must be a /etc on the volume *before* Veritas can begin its data restore. Once the volume was recreated, we began our restore of approximately 150 Gigabytes of data which took us well into the morning.
The NetApp engineer in California and I had a long conversation over the phone while the data restores were beginning. He was certain that there should have been some sort of sign that this was about to happen. I assured him that I would look and would forward my results to our local support people, who are copied on the autosupport messages too. As I went back through the autosupport logs that are e-mailed to me each week, I found that the problem began approximately two weeks earlier. Every time a the disk tried to read a particular sector of the disk, an error messages would appear the messages log indicating such an event had occurred. Had I not been busily working other issues, to the detriment of my filers, I would have failed this disk at lease a week prior.
-gdg
Mark D Fowle wrote:
I have heard a few horror stories lately about netapps and multi-disk raid failures. Has anyone out there experienced this and what did you do for recovery ? Where there any warnings? I have not had this happen and would like to do as much as possible to prevent it.
Thanks,
======= Mark Fowle Caterpillar/BCP Cary North Carolina =============================================================================== =======
-- --------------------------------------------------------------- G D Geen mailto:geen@ti.com Texas Instruments Phone : (214)480.7896 System Administrator FAX : (214)480.7676 --------------------------------------------------------------- Life is what happens while you're busy making other plans. -J. Lennon