Re: Fiber Channel filer failures

22 Feb 1999


      Just to expand on what Steve Gremban has given you.  We have seen the
same errors.  Network Appliance had not given me indication that they
think the problem is heat releated.  As a matter of fact I got a bug
report number, 1549, and was requested to run "wackz" on the filer to 
fix the parity inconsistancies.  From the begining, I was told that this 
was a hardware problem and the fix was to run "wackz".  We had two of the 
core files analyzed by the NetApp system engineers to confirm their findings.
I have not yet talked to the system engineer who performed the analysis,
but I did talk with the support line people who were able to read the
resolution status to me over the phone.  I will ask that the status be
e-mailed to me on Monday morning.  Once I receive the report, I will forward
the summarized information on to you.
Bug ID 1549
Title
System crashes with the "write_alloc.c:xxx: Assertion failure" message.
Problem Description
When the filer removes a file, it tries to resize the filer's inode
      and free a block associated with the inode. However, when
      the filer is verifying whether the block to be freed is part
      of the active file system, if the block is already free, the
      filer crashes.
Workaround
Run the wack utility on the filer. For more information
      about wack, contact Network Appliance Technical Support.
The thermometer/humidity gauge was in the cabinet next to the filer that
reported the problem.  Since it was placed in this cabinet, the min. temp
was 85 deg. F and max was 91 deg. F.  The humidity ranged from 26 to 56%.
I have moved the thermometer into the cabinet housing the filer we are 
currently having problems with and have reset the min/max.
If you are still concerned about temp., get a temp/humidity gauge from
Radio Shack, they are about $20.
I strongly suggest, if you have not done so already, run "wackz" on your 
filer as soon as possible and contact your NetApp support.
-gdg
...
Graham,
We just started seeing the same problems with one of our F760's after 45
days of uptime. It first started happening last week 2/14 during a scrub.
The filer rebooted and recovered ok. We opened a call (#52749) with Netapp.
It happened again 2/19 and twice on 2/20. On 2/21 the filer booted but was
not able to recover. Netapp had us run "wackz" to recover.
There are no thermal warnings in the messages file. Tomorrow we will put a
thermometer in the cabinet to check it out.
-Steve        gremban@ti.com
Here are some excerpts from the system notification message:
===== SYSCONFIG-V =====
        NetApp Release 5.1.1: Sat Aug 1 10:22:36 PDT 1998
        System ID: 0016786020 (regina)
        slot 0: System Board (NetApp System Board V H0)
                Model Name:         F760
                Serial Number:      300258
                Firmware release:   2.0_a2
                Memory Size:        1024 MB
        slot 0: FC adapter:     isp2100 (chip rev. 3)
                Firmware rev:   1.12
                Host Loop Id:   119
                Cacheline size: 8       FC Packet size: 2048
                0: SEAGATE ST19171FC       0017 Size=8.6GB (17783112 blocks)
            1: SEAGATE ST19171FC       0017 Size=8.6GB (17783112 blocks)


...
        slot 8: FC adapter:     isp2100 (chip rev. 3)
                Firmware rev:   1.12
                Host Loop Id:   119
                Cacheline size: 8       FC Packet size: 2048
                0: SEAGATE ST19171FC       0017 Size=8.6GB (17783112 blocks)
            1: SEAGATE ST19171FC       0017 Size=8.6GB (17783112 blocks)


...
===== MESSAGES =====
Sun Feb 14 00:00:09 CST [asup_main]: System Notification mail sent
Sun Feb 14 01:00:01 CST [statd]:   1:00am up 45 days, 17:14 236884596 NFS
ops, 0 CIFS ops, 0 HTTP ops
Sun Feb 14 01:00:01 CST [raid_scrub_admin]: Beginning disk scrubbing...
Sun Feb 14 02:00:01 CST [statd]:   2:00am up 45 days, 18:14 236888295 NFS
ops, 0 CIFS ops, 0 HTTP ops
Sun Feb 14 02:46:26 CST [isp2100_timeout]: 0a.4 (cmd opcode 0x28, retry
count 0): command timeout, aborting request
Sun Feb 14 02:46:31 CST [isp2100_timeout]: isp2100_reset_device: 0a.4 (0.4)
failed
Sun Feb 14 02:46:31 CST [isp2100_timeout]: Resetting ISP2100 0a (ha #0)
Sun Feb 14 03:00:01 CST [statd]:   3:00am up 45 days, 19:14 236889166 NFS
ops, 0 CIFS ops, 0 HTTP ops
Sun Feb 14 03:25:13 CST [isp2100_main]: isp2100_error_proc: dev 0a.9 (0.9)
data underrun (cmd opcode 0x28) ( 0x0 0x5c28 0x0 )
Sun Feb 14 03:25:13 CST [isp2100_main]: Disk 0a.9: Data underrun
(0xfffffc0000c2a7f0,0x28,0)
Sun Feb 14 03:25:13 CST [raid_stripe_owner]: Inconsistent parity on volume
vol2, RAID group 0, stripe #2151004.
Sun Feb 14 03:25:13 CST [raid_stripe_owner]: Rewriting bad parity block on
volume vol2, RAID group 0, stripe #2151004.
Sun Feb 14 03:25:13 CST [raid_stripe_owner]: Inconsistent parity on volume
vol2, RAID group 0, stripe #2151005.
Sun Feb 14 03:25:13 CST [raid_stripe_owner]: Rewriting bad parity block on
volume vol2, RAID group 0, stripe #2151005.
Sun Feb 14 03:25:13 CST [raid_stripe_owner]: Inconsistent parity on volume
vol2, RAID group 0, stripe #2151006.
...
Sun Feb 14 07:20:07 CST [raid_stripe_owner]: Rewriting bad parity block on
volume vol2, RAID group 0, stripe #2156520.
Sun Feb 14 07:20:07 CST [raid_stripe_owner]: Inconsistent parity on volume
vol2, RAID group 0, stripe #2156521.
Sun Feb 14 07:20:07 CST [raid_stripe_owner]: Out of messages; cannot dump
contents of inconsistent stripe.
Sun Feb 14 07:20:24 CST [raid_stripe_owner]: Inconsistent parity on volume
vol2, RAID group 0, stripe #2156524.
Sun Feb 14 07:20:24 CST [raid_stripe_owner]: Out of messages; cannot dump
contents of inconsistent stripe.
Sun Feb 14 07:20:27 CST [raid_stripe_owner]: Inconsistent parity on volume
vol2, RAID group 0, stripe #2156528.
Sun Feb 14 07:20:27 CST [raid_stripe_owner]: Out of messages; cannot dump
contents of inconsistent stripe.
Sun Feb 14 13:25:56 GMT [sm_recover]: no address for host
<cdsd08.msp.sc.ti.com>
Sun Feb 14 13:25:59 GMT last message repeated 2 times
Sun Feb 14 13:26:01 GMT [rc]: de_main: e3a : Link up.
Sun Feb 14 13:26:03 GMT [sm_recover]: no address for host
<cdsd08.msp.sc.ti.com>
Sun Feb 14 13:26:06 GMT [rc]: de_main: e3b : Link up.
Sun Feb 14 13:26:11 GMT [sm_recover]: no address for host
<cdsd08.msp.sc.ti.com>
Sun Feb 14 13:26:12 GMT [rc]: de_main: e3c : Link up.
Sun Feb 14 13:26:17 GMT [rc]: de_main: e3d : Link up.
Sun Feb 14 07:26:18 CST [rc]: saving 1045M to /etc/crash/core.0
("../common/wafl/write_alloc.c:788: Assertion failure.")
Sun Feb 14 07:28:44 CST [rc]: relog syslog Sun Feb 14 07:21:22 CST
[raid_stripe_owner]: Rewriting bad parity block on volume vol2, RAID group
0, stripe #2156543.
Sun Feb 14 07:28:44 CST [rc]: NetApp Release 5.1.1 boot complete.  Last disk
update written at Sun Feb 14 07:25:53 CST 1999
Sun Feb 14 07:28:46 CST [asup_main]: System Notification mail sent
Sun Feb 14 08:00:01 CST [statd]:   8:00am up 35 mins, 2314 NFS ops, 0 CIFS
ops, 0 HTTP ops
...
Tue Feb 16 15:00:01 CST [statd]:   3:00pm up  2 days,  7:34 28486473 NFS
ops, 0 CIFS ops, 0 HTTP ops
Tue Feb 16 15:17:09 CST [isp2100_main]: Disk 0a.29(0xfffffc0000bde6d0):
opcode=0x2a sector 5666184 aborted command (b 47, 0)
Tue Feb 16 15:19:32 CST [isp2100_main]: Disk 0a.36(0xfffffc0000c42290):
opcode=0x2a sector 5682248 aborted command (b 47, 0)
Tue Feb 16 16:00:01 CST [statd]:   4:00pm up  2 days,  8:34 29062201 NFS
ops, 0 CIFS ops, 0 HTTP ops
Tue Feb 16 17:00:00 CST [statd]:   5:00pm up  2 days,  9:34 29525471 NFS
ops, 0 CIFS ops, 0 HTTP ops
Tue Feb 16 17:05:11 CST [isp2100_main]: Disk 0a.11(0xfffffc0000c50610):
opcode=0x28 sector 0 not ready (2 4, 1)
Tue Feb 16 17:05:11 CST [isp2100_main]: Disk 0a.11(0xfffffc0000bd2d90):
opcode=0x28 sector 0 not ready (2 4, 1)
Tue Feb 16 17:05:55 CST [isp2100_main]: Disk 0a.11(0xfffffc0000c6c830):
opcode=0x28 sector 0 not ready (2 4, 1)
Tue Feb 16 17:05:55 CST [isp2100_main]: Disk 0a.11(0xfffffc0000bbff50):
opcode=0x28 sector 0 not ready (2 4, 1)
Tue Feb 16 17:05:55 CST [isp2100_main]: Disk 0a.11(0xfffffc0000bb20b0):
opcode=0x28 sector 0 not ready (2 4, 1)
Tue Feb 16 17:05:55 CST [isp2100_main]: Disk 0a.11(0xfffffc0000c00730):
opcode=0x28 sector 0 not ready (2 4, 1)
Tue Feb 16 17:05:55 CST [isp2100_main]: Disk 0a.11(0xfffffc0000c15fb0):
opcode=0x28 sector 0 not ready (2 4, 1)
Tue Feb 16 17:05:55 CST [isp2100_main]: Disk 0a.11(0xfffffc0000bd2090):
opcode=0x28 sector 0 not ready (2 4, 1)
Tue Feb 16 18:00:01 CST [statd]:   6:00pm up  2 days, 10:34 29809864 NFS
ops, 0 CIFS ops, 0 HTTP ops
...
Thu Feb 18 13:00:01 CST [statd]:   1:00pm up  4 days,  5:33 48972544 NFS
ops, 0 CIFS ops, 0 HTTP ops
Thu Feb 18 13:43:50 CST [isp2100_main]: Disk 0a.41(0xfffffc0000c1b770):
opcode=0x28 sector 633677 recovered error (1 18, 2)
Thu Feb 18 14:00:01 CST [statd]:   2:00pm up  4 days,  6:33 50276094 NFS
ops, 0 CIFS ops, 0 HTTP ops
Thu Feb 18 14:02:47 CST [isp2100_main]: Disk 0a.41(0xfffffc0000c3eb50):
opcode=0x28 sector 673220 recovered error (1 9, 0)
Thu Feb 18 14:50:31 CST [isp2100_timeout]: 0a.14 (cmd opcode 0x2a, retry
count 0): command timeout, aborting request
Thu Feb 18 14:50:36 CST [isp2100_timeout]: isp2100_reset_device: 0a.14
(0.14) failed
Thu Feb 18 14:50:37 CST [isp2100_timeout]: Resetting ISP2100 0a (ha #0)
Thu Feb 18 15:00:00 CST [statd]:   3:00pm up  4 days,  7:33 51645917 NFS
ops, 0 CIFS ops, 0 HTTP ops
...
Fri Feb 19 15:00:01 CST [statd]:   3:00pm up  5 days,  7:33 56936830 NFS
ops, 0 CIFS ops, 0 HTTP ops
Fri Feb 19 15:25:30 CST [isp2100_main]: isp2100_error_proc: dev 0a.26 (0.26)
data overrun occured (cmd opcode 0x28) ( 0x0 0xc28 0x0 )
Fri Feb 19 15:25:30 CST [isp2100_main]: Disk 0a.26: Data overrun
(0xfffffc0000c3b0d0,0x28,0)
Fri Feb 19 15:25:32 CST [edm_admin]: No valid paths to Enclosure Services in
shelf 1 on ha 0a.
Fri Feb 19 15:25:33 CST [edm_admin]: No valid paths to Enclosure Services in
shelf 0 on ha 0a.
Fri Feb 19 15:25:33 CST [edm_admin]: No valid paths to Enclosure Services in
shelf 2 on ha 0a.
Fri Feb 19 15:25:33 CST [edm_admin]: No valid paths to Enclosure Services in
shelf 3 on ha 0a.
Fri Feb 19 15:25:33 CST [edm_admin]: No valid paths to Enclosure Services in
shelf 4 on ha 0a.
Fri Feb 19 15:25:33 CST [edm_admin]: No valid paths to Enclosure Services in
shelf 5 on ha 0a.
Fri Feb 19 15:25:33 CST [edm_admin]: No valid paths to Enclosure Services in
shelf 6 on ha 0a.
Fri Feb 19 15:25:33 CST [edm_admin]: No valid paths to Enclosure Services in
shelf 7 on ha 0a.
...
Fri Feb 19 21:30:49 GMT [rc]: de_main: e3c : Link up.
Fri Feb 19 21:30:55 GMT [rc]: de_main: e3d : Link up.
Fri Feb 19 15:30:56 CST [rc]: saving 1045M to /etc/crash/core.1
("../common/wafl/dir.c:993: Assertion failure.")
Fri Feb 19 15:32:31 CST [tn_login_0]: Login from host: cdsd01
Fri Feb 19 15:33:13 CST [tn_login_0]: Login from host: melon10.msp.sc.ti.com
Fri Feb 19 15:33:22 CST [rc]: relog syslog Fri Feb 19 15:25:33 CST
[edm_admin]: No valid paths to Enclosure Services in shelf 7 on ha 0a.
Fri Feb 19 15:33:22 CST [rc]: NetApp Release 5.1.1 boot complete.  Last disk
update written at Fri Feb 19 15:30:30 CST 1999
Fri Feb 19 15:33:24 CST [asup_main]: System Notification mail sent
Fri Feb 19 16:00:01 CST [statd]:   4:00pm up 29 mins, 145079 NFS ops, 0 CIFS
ops, 0 HTTP ops
Fri Feb 19 16:01:36 CST [isp2100_main]: Disk 0a.41(0xfffffc0000bf8bb0):
opcode=0x2a sector 225337 recovered error (1 3, 0)
Fri Feb 19 17:00:01 CST [statd]:   5:00pm up  1:29 553443 NFS ops, 0 CIFS
ops, 0 HTTP ops
Fri Feb 19 18:00:01 CST [statd]:   6:00pm up  2:29 1762570 NFS ops, 0 CIFS
ops, 0 HTTP ops
Fri Feb 19 18:33:17 CST [isp2100_main]: Disk 0a.41(0xfffffc0000ba94f0):
opcode=0x2a sector 593920 recovered error (1 3, 0)
Fri Feb 19 19:00:01 CST [statd]:   7:00pm up  3:29 2474457 NFS ops, 0 CIFS
ops, 0 HTTP ops
...
Sat Feb 20 12:00:00 CST [statd]:  12:00pm up 20:29 8909278 NFS ops, 0 CIFS
ops, 0 HTTP ops
Sat Feb 20 18:31:47 GMT [sm_recover]: no address for host <cdsd08>
Sat Feb 20 18:31:47 GMT [sm_recover]: no address for host <cdsd07>
...
Sat Feb 20 18:32:02 GMT [rc]: de_main: e3c : Link up.
Sat Feb 20 18:32:08 GMT [rc]: de_main: e3d : Link up.
Sat Feb 20 12:32:09 CST [rc]: saving 1045M to /etc/crash/core.2
("wafl_check_vbns: vbn too big")
Sat Feb 20 12:34:37 CST [rc]: relog syslog Sat Feb 20 12:27:53 CST
[isp2100_main]: Disk 0a.16: Data overrun (0xfffffc0000c5d130,0x28,0)
Sat Feb 20 12:34:37 CST [rc]: NetApp Release 5.1.1 boot complete.  Last disk
update written at Sat Feb 20 12:31:42 CST 1999
Sat Feb 20 12:34:39 CST [asup_main]: System Notification mail sent
Sat Feb 20 12:55:27 CST [isp2100_main]: isp2100_error_proc: dev 0a.9 (0.9)
data overrun occured (cmd opcode 0x28) ( 0x0 0xc28 0x0 )
Sat Feb 20 12:55:27 CST [isp2100_main]: Disk 0a.9: Data overrun
(0xfffffc0000c54570,0x28,0)
Sat Feb 20 18:59:22 GMT [sm_recover]: no address for host <cdsd08>
Sat Feb 20 18:59:22 GMT [sm_recover]: no address for host <cdsd07>
...
Sat Feb 20 18:59:38 GMT [rc]: de_main: e3c : Link up.
Sat Feb 20 18:59:43 GMT [rc]: de_main: e3d : Link up.
Sat Feb 20 12:59:45 CST [rc]: saving 1045M to /etc/crash/core.3
("wafl_check_vbns: vbn too big")
Sat Feb 20 13:02:13 CST [rc]: relog syslog Sat Feb 20 12:55:27 CST
[isp2100_main]: Disk 0a.9: Data overrun (0xfffffc0000c54570,0x28,0)
Sat Feb 20 13:02:13 CST [rc]: NetApp Release 5.1.1 boot complete.  Last disk
update written at Sat Feb 20 12:59:19 CST 1999
Graham Knight wrote:
...
Last week we had a fiber channel F760 completely crash and burn. Right
now
the system is in a state where it will serve data for 10 or 15 minutes
and then
it reports inconsistent parity errors and crashes. We will be sending it
to
Netapp for failure analysis.
That was a nightmare to say the least - and, of course, it happened
while i was on
vacation. (My first thought was that somebody was mad at me for flaming
about
the disk firmware thing and decided to take it out on me personally -
yikes - i
swear off flaming for the rest of my life). Anyway, last night another
of our filers
crashed with the same type of errors. This is an F630 with FC disks. I
called
NetApp - just got off the phone with them - they are looking into it,
but before i lose
another filer i thought i'd bounce this one off of the user community.
These are the types of errors it spits out:
Sun Feb 21 04:22:00 MST [isp2100_main]: isp2100_error_proc: dev 8.11
(0.11) data underrun (cmd opcode 0x28) ( 0x0 0x5c28 0x8000 )
Sun Feb 21 04:22:00 MST [isp2100_main]: Disk 8.11: Data underrun
(0xfffffc0000c204f0,0x28,0)
a few of those then a few of these:
Sun Feb 21 04:22:00 MST [raid_stripe_owner]: Inconsistent parity on
volume vol0, RAID group 2, stripe #377824.
Sun Feb 21 04:22:00 MST [raid_stripe_owner]: Rewriting bad parity block
on volume vol0, RAID group 2, stripe #377824.
Then:
Sun Feb 21 04:23:22 MST [edm_admin]: No valid paths to Enclosure
Services in shelf 2 on ha 8.
Sun Feb 21 04:23:22 MST [raid_stripe_owner]: Inconsistent parity on
volume vol0, RAID group 2, stripe #377861.
Sun Feb 21 04:23:22 MST [raid_stripe_owner]: Out of messages; cannot
dump contents of inconsistent stripe.
Sun Feb 21 04:24:00 MST [edm_admin]: No valid paths to Enclosure
Services in shelf 3 on ha 8.
Sun Feb 21 11:30:36 GMT [rc]: de_main: e10 : Link up.
Sun Feb 21 04:30:37 MST [rc]: saving 44M to /etc/crash/core.0.nz
("wafl_check_vbns: vbn too big")
Netapp thinks the previous failure may have been due to heat in the
computer room - BUT - the
filer *never* spit out any thermal warnings - I'm not too sure if i
believe that heat was the
culprit - i can run it for hours doing disk scrubs, but as soon as i use
it for NIS it
crashes. This system is in a different room and has a beter ventilated
cabinet - again,
no thermal warnings in the messages file - but maybe the thermal warning
messages
don't work? Just out of curiosity - has anyone ever seen a FC system
spit out
a thermal warning?? (5.1.2P2)
Has anyone experienced these types of problem with FC filers? I've never
had a problem
with SCSI filers - but these FC ones just seem flaky to me....
Thanks for any insight anyone can lend...
Graham
-- 
---------------------------------------------------------------
G D Geen                        mailto:geen@ti.com 
Texas Instruments               Phone : (972)480.7896
System Administrator            FAX   : (972)480.7676
---------------------------------------------------------------
Life is what happens while you're busy making other plans.
                                              -J. Lennon

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: Fiber Channel filer failures