rampant disk failure - toasters

24 Jul 2002


      I have recently seen surge in the number of disk failures on an F760
cluster in our datacenter.  We have four F760's purchased a few months
apart; one cluster has seen 12 disks fail in the past three months while
the other cluster has seen two disks fail (I can only actually remember
one, but I'm allowing for my failing memory as well).
The clusters sit only a few feet apart so I am discounting
environmental problems.  Both clusters are running NetApp Release
6.1R1P1.  The "bad" cluster is using mostly Seagate ST318203FC 18 GB
disks, while the "good" cluster is mostly the Seagate ST118202FC drives
(the infamous spin-up problem disks).  The good cluster is unbalanced;
one head has 52 disks and the other has 32.  The bad cluster is evenly
balanced (or it was before we started losing disks en masse) with 42 on
each side.  Both clusters are running disk firmware NA10 for the
ST318203FC disks (I know, I just discovered that it's one rev out of
date) and NA27 for the ST118202FC disks.
The strangest part of this whole situation is that the disks rarely
fail; they disappear and the partner complains that there is a cluster
mismatch, breaks clustering, and sends out an email.  The filer with the
missing disk starts to rebuild (if it was a data disk) or merrily goes
on its way (if it was a spare), but nothing ever shows up as broken. 
The disk just disappears.
Short of going through and replacing every piece of hardware in the
"bad" filers, I am at a loss of how to proceed.  I've spent the morning
searching NOW without luck.  [Someone just pointed out to me that we
have a few X221_ST318304FC disks with NA06 firmware in several of our
filers, not just the good and bad clusters I've been describing, opening
us up to bug 27068 (we're trying to schedule downtime to upgrade the
firmware on all our filers now).]  I am going to try upgrading the disk
firmware on the filers as a first step, but if anyone else has seen this
problem, or something similar, I would appreciate any input.
Geoff Hardin
geoff.hardin@dalsemi.com
If it's glowing, don't eat it...