Last week we had a fiber channel F760 completely crash and burn. Right
now
the system is in a state where it will serve data for 10 or 15 minutes
and then
it reports inconsistent parity errors and crashes. We will be sending it
to
Netapp for failure analysis.
That was a nightmare to say the least - and, of course, it happened
while i was on
vacation. (My first thought was that somebody was mad at me for flaming
about
the disk firmware thing and decided to take it out on me personally -
yikes - i
swear off flaming for the rest of my life). Anyway, last night another
of our filers
crashed with the same type of errors. This is an F630 with FC disks. I
called
NetApp - just got off the phone with them - they are looking into it,
but before i lose
another filer i thought i'd bounce this one off of the user community.
These are the types of errors it spits out:
Sun Feb 21 04:22:00 MST [isp2100_main]: isp2100_error_proc: dev 8.11
(0.11) data underrun (cmd opcode 0x28) ( 0x0 0x5c28 0x8000 )
Sun Feb 21 04:22:00 MST [isp2100_main]: Disk 8.11: Data underrun
(0xfffffc0000c204f0,0x28,0)
a few of those then a few of these:
Sun Feb 21 04:22:00 MST [raid_stripe_owner]: Inconsistent parity on
volume vol0, RAID group 2, stripe #377824.
Sun Feb 21 04:22:00 MST [raid_stripe_owner]: Rewriting bad parity block
on volume vol0, RAID group 2, stripe #377824.
Then:
Sun Feb 21 04:23:22 MST [edm_admin]: No valid paths to Enclosure
Services in shelf 2 on ha 8.
Sun Feb 21 04:23:22 MST [raid_stripe_owner]: Inconsistent parity on
volume vol0, RAID group 2, stripe #377861.
Sun Feb 21 04:23:22 MST [raid_stripe_owner]: Out of messages; cannot
dump contents of inconsistent stripe.
Sun Feb 21 04:24:00 MST [edm_admin]: No valid paths to Enclosure
Services in shelf 3 on ha 8.
Sun Feb 21 11:30:36 GMT [rc]: de_main: e10 : Link up.
Sun Feb 21 04:30:37 MST [rc]: saving 44M to /etc/crash/core.0.nz
("wafl_check_vbns: vbn too big")
Netapp thinks the previous failure may have been due to heat in the
computer room - BUT - the
filer *never* spit out any thermal warnings - I'm not too sure if i
believe that heat was the
culprit - i can run it for hours doing disk scrubs, but as soon as i use
it for NIS it
crashes. This system is in a different room and has a beter ventilated
cabinet - again,
no thermal warnings in the messages file - but maybe the thermal warning
messages
don't work? Just out of curiosity - has anyone ever seen a FC system
spit out
a thermal warning?? (5.1.2P2)
Has anyone experienced these types of problem with FC filers? I've never
had a problem
with SCSI filers - but these FC ones just seem flaky to me....
Thanks for any insight anyone can lend...
Graham