I have recently seen surge in the number of disk failures on an F760 cluster in our datacenter. We have four F760's purchased a few months apart; one cluster has seen 12 disks fail in the past three months while the other cluster has seen two disks fail (I can only actually remember one, but I'm allowing for my failing memory as well).
The clusters sit only a few feet apart so I am discounting environmental problems. Both clusters are running NetApp Release 6.1R1P1. The "bad" cluster is using mostly Seagate ST318203FC 18 GB disks, while the "good" cluster is mostly the Seagate ST118202FC drives (the infamous spin-up problem disks). The good cluster is unbalanced; one head has 52 disks and the other has 32. The bad cluster is evenly balanced (or it was before we started losing disks en masse) with 42 on each side. Both clusters are running disk firmware NA10 for the ST318203FC disks (I know, I just discovered that it's one rev out of date) and NA27 for the ST118202FC disks.
The strangest part of this whole situation is that the disks rarely fail; they disappear and the partner complains that there is a cluster mismatch, breaks clustering, and sends out an email. The filer with the missing disk starts to rebuild (if it was a data disk) or merrily goes on its way (if it was a spare), but nothing ever shows up as broken. The disk just disappears.
Short of going through and replacing every piece of hardware in the "bad" filers, I am at a loss of how to proceed. I've spent the morning searching NOW without luck. [Someone just pointed out to me that we have a few X221_ST318304FC disks with NA06 firmware in several of our filers, not just the good and bad clusters I've been describing, opening us up to bug 27068 (we're trying to schedule downtime to upgrade the firmware on all our filers now).] I am going to try upgrading the disk firmware on the filers as a first step, but if anyone else has seen this problem, or something similar, I would appreciate any input.
Geoff Hardin geoff.hardin@dalsemi.com If it's glowing, don't eat it...
This sounds more like a general fiber channel error, perhaps from a bad LRC or cable, or card. You should open up a ticket with Network Appliance, or at the very least boot into maintenance mode and run some of the detailed fiber channel tests from the 1-5 menu.
On Wed, 24 Jul 2002, Geoff Hardin wrote:
| I have recently seen surge in the number of disk failures on an F760 | cluster in our datacenter. We have four F760's purchased a few months | apart; one cluster has seen 12 disks fail in the past three months while | the other cluster has seen two disks fail (I can only actually remember | one, but I'm allowing for my failing memory as well). | | The clusters sit only a few feet apart so I am discounting | environmental problems. Both clusters are running NetApp Release | 6.1R1P1. The "bad" cluster is using mostly Seagate ST318203FC 18 GB | disks, while the "good" cluster is mostly the Seagate ST118202FC drives | (the infamous spin-up problem disks). The good cluster is unbalanced; | one head has 52 disks and the other has 32. The bad cluster is evenly | balanced (or it was before we started losing disks en masse) with 42 on | each side. Both clusters are running disk firmware NA10 for the | ST318203FC disks (I know, I just discovered that it's one rev out of | date) and NA27 for the ST118202FC disks. | | The strangest part of this whole situation is that the disks rarely | fail; they disappear and the partner complains that there is a cluster | mismatch, breaks clustering, and sends out an email. The filer with the | missing disk starts to rebuild (if it was a data disk) or merrily goes | on its way (if it was a spare), but nothing ever shows up as broken. | The disk just disappears. | | Short of going through and replacing every piece of hardware in the | "bad" filers, I am at a loss of how to proceed. I've spent the morning | searching NOW without luck. [Someone just pointed out to me that we | have a few X221_ST318304FC disks with NA06 firmware in several of our | filers, not just the good and bad clusters I've been describing, opening | us up to bug 27068 (we're trying to schedule downtime to upgrade the | firmware on all our filers now).] I am going to try upgrading the disk | firmware on the filers as a first step, but if anyone else has seen this | problem, or something similar, I would appreciate any input. | | Geoff Hardin | geoff.hardin@dalsemi.com | If it's glowing, don't eat it... |