Brian Tao wrote:
Does anyone else find the wording of syslog messages confusing? I
almost had a small heart attack when I saw this after failing out a marginal drive, thinking that two drives had just died. "Disk 8" and "Data disk 7" are the same thing, and I wish the filer would use one or the either.
Mon Aug 4 15:45:27 EDT [raid_disk_admin]: Unload of disk 8 has completed successfully. Mon Aug 4 15:45:27 EDT [raid_stripe_owner]: Read on data disk 7 failed, reverting to degraded mode.
Admittedly, this is confusing. I hope your heart is feeling better now.
Just one clarification. The netapp will not "fail" two drives. If it has a "read failure" on a drive while it is in degraded mode, and all retrys fail, it will reboot to try and "jar" the offending SCSI device back to life, but will not actually "fail" it in the RAID sense.
In a worst case scenario, i.e., you have a double disk failure, a system in degraded mode cannot complete reconstruction because of all the noise being made by the second "failing" drive causing reboots before reconstruction is complete.
This scenario is actually useful for rescuing data because of the statelessness of NFS. Data can be accessed from clients with the occasional NFS Server not responding due to the reboots.
That is basically what a "double disk failure" is. It does not mean that suddenly you cannot access any of your data. You can choose to use "rdist" or some other file copy over NFS to copy the data, or decide that your backups are reliable, and restore on a new filesystem. It might even be able to wait until the weekend with reboots every hour or so.
So basically, give your heart a rest, and don't panic if a message appears. This is a very very *VERY* rare occurence.
Ken.