Just to expand on what Steve Gremban has given you. We have seen the same errors. Network Appliance had not given me indication that they think the problem is heat releated. As a matter of fact I got a bug report number, 1549, and was requested to run "wackz" on the filer to fix the parity inconsistancies. From the begining, I was told that this was a hardware problem and the fix was to run "wackz". We had two of the core files analyzed by the NetApp system engineers to confirm their findings.
I have not yet talked to the system engineer who performed the analysis, but I did talk with the support line people who were able to read the resolution status to me over the phone. I will ask that the status be e-mailed to me on Monday morning. Once I receive the report, I will forward the summarized information on to you.
Bug ID 1549 Title
System crashes with the "write_alloc.c:xxx: Assertion failure" message.
Problem Description
When the filer removes a file, it tries to resize the filer's inode and free a block associated with the inode. However, when the filer is verifying whether the block to be freed is part of the active file system, if the block is already free, the filer crashes.
Workaround
Run the wack utility on the filer. For more information about wack, contact Network Appliance Technical Support.
The thermometer/humidity gauge was in the cabinet next to the filer that reported the problem. Since it was placed in this cabinet, the min. temp was 85 deg. F and max was 91 deg. F. The humidity ranged from 26 to 56%. I have moved the thermometer into the cabinet housing the filer we are currently having problems with and have reset the min/max.
If you are still concerned about temp., get a temp/humidity gauge from Radio Shack, they are about $20.
I strongly suggest, if you have not done so already, run "wackz" on your filer as soon as possible and contact your NetApp support.
-gdg
Graham,
We just started seeing the same problems with one of our F760's after 45 days of uptime. It first started happening last week 2/14 during a scrub. The filer rebooted and recovered ok. We opened a call (#52749) with Netapp. It happened again 2/19 and twice on 2/20. On 2/21 the filer booted but was not able to recover. Netapp had us run "wackz" to recover.
There are no thermal warnings in the messages file. Tomorrow we will put a thermometer in the cabinet to check it out.
-Steve gremban@ti.com
Here are some excerpts from the system notification message:
===== SYSCONFIG-V ===== NetApp Release 5.1.1: Sat Aug 1 10:22:36 PDT 1998 System ID: 0016786020 (regina) slot 0: System Board (NetApp System Board V H0) Model Name: F760 Serial Number: 300258 Firmware release: 2.0_a2 Memory Size: 1024 MB slot 0: FC adapter: isp2100 (chip rev. 3) Firmware rev: 1.12 Host Loop Id: 119 Cacheline size: 8 FC Packet size: 2048 0: SEAGATE ST19171FC 0017 Size=8.6GB (17783112 blocks)
1: SEAGATE ST19171FC 0017 Size=8.6GB (17783112 blocks)
... slot 8: FC adapter: isp2100 (chip rev. 3) Firmware rev: 1.12 Host Loop Id: 119 Cacheline size: 8 FC Packet size: 2048 0: SEAGATE ST19171FC 0017 Size=8.6GB (17783112 blocks)
1: SEAGATE ST19171FC 0017 Size=8.6GB (17783112 blocks)
... ===== MESSAGES ===== Sun Feb 14 00:00:09 CST [asup_main]: System Notification mail sent Sun Feb 14 01:00:01 CST [statd]: 1:00am up 45 days, 17:14 236884596 NFS ops, 0 CIFS ops, 0 HTTP ops Sun Feb 14 01:00:01 CST [raid_scrub_admin]: Beginning disk scrubbing... Sun Feb 14 02:00:01 CST [statd]: 2:00am up 45 days, 18:14 236888295 NFS ops, 0 CIFS ops, 0 HTTP ops Sun Feb 14 02:46:26 CST [isp2100_timeout]: 0a.4 (cmd opcode 0x28, retry count 0): command timeout, aborting request Sun Feb 14 02:46:31 CST [isp2100_timeout]: isp2100_reset_device: 0a.4 (0.4) failed Sun Feb 14 02:46:31 CST [isp2100_timeout]: Resetting ISP2100 0a (ha #0) Sun Feb 14 03:00:01 CST [statd]: 3:00am up 45 days, 19:14 236889166 NFS ops, 0 CIFS ops, 0 HTTP ops Sun Feb 14 03:25:13 CST [isp2100_main]: isp2100_error_proc: dev 0a.9 (0.9) data underrun (cmd opcode 0x28) ( 0x0 0x5c28 0x0 ) Sun Feb 14 03:25:13 CST [isp2100_main]: Disk 0a.9: Data underrun (0xfffffc0000c2a7f0,0x28,0) Sun Feb 14 03:25:13 CST [raid_stripe_owner]: Inconsistent parity on volume vol2, RAID group 0, stripe #2151004. Sun Feb 14 03:25:13 CST [raid_stripe_owner]: Rewriting bad parity block on volume vol2, RAID group 0, stripe #2151004. Sun Feb 14 03:25:13 CST [raid_stripe_owner]: Inconsistent parity on volume vol2, RAID group 0, stripe #2151005. Sun Feb 14 03:25:13 CST [raid_stripe_owner]: Rewriting bad parity block on volume vol2, RAID group 0, stripe #2151005. Sun Feb 14 03:25:13 CST [raid_stripe_owner]: Inconsistent parity on volume vol2, RAID group 0, stripe #2151006. ... Sun Feb 14 07:20:07 CST [raid_stripe_owner]: Rewriting bad parity block on volume vol2, RAID group 0, stripe #2156520. Sun Feb 14 07:20:07 CST [raid_stripe_owner]: Inconsistent parity on volume vol2, RAID group 0, stripe #2156521. Sun Feb 14 07:20:07 CST [raid_stripe_owner]: Out of messages; cannot dump contents of inconsistent stripe. Sun Feb 14 07:20:24 CST [raid_stripe_owner]: Inconsistent parity on volume vol2, RAID group 0, stripe #2156524. Sun Feb 14 07:20:24 CST [raid_stripe_owner]: Out of messages; cannot dump contents of inconsistent stripe. Sun Feb 14 07:20:27 CST [raid_stripe_owner]: Inconsistent parity on volume vol2, RAID group 0, stripe #2156528. Sun Feb 14 07:20:27 CST [raid_stripe_owner]: Out of messages; cannot dump contents of inconsistent stripe. Sun Feb 14 13:25:56 GMT [sm_recover]: no address for host <cdsd08.msp.sc.ti.com> Sun Feb 14 13:25:59 GMT last message repeated 2 times Sun Feb 14 13:26:01 GMT [rc]: de_main: e3a : Link up. Sun Feb 14 13:26:03 GMT [sm_recover]: no address for host <cdsd08.msp.sc.ti.com> Sun Feb 14 13:26:06 GMT [rc]: de_main: e3b : Link up. Sun Feb 14 13:26:11 GMT [sm_recover]: no address for host <cdsd08.msp.sc.ti.com> Sun Feb 14 13:26:12 GMT [rc]: de_main: e3c : Link up. Sun Feb 14 13:26:17 GMT [rc]: de_main: e3d : Link up. Sun Feb 14 07:26:18 CST [rc]: saving 1045M to /etc/crash/core.0 ("../common/wafl/write_alloc.c:788: Assertion failure.") Sun Feb 14 07:28:44 CST [rc]: relog syslog Sun Feb 14 07:21:22 CST [raid_stripe_owner]: Rewriting bad parity block on volume vol2, RAID group 0, stripe #2156543.
Sun Feb 14 07:28:44 CST [rc]: NetApp Release 5.1.1 boot complete. Last disk update written at Sun Feb 14 07:25:53 CST 1999
Sun Feb 14 07:28:46 CST [asup_main]: System Notification mail sent Sun Feb 14 08:00:01 CST [statd]: 8:00am up 35 mins, 2314 NFS ops, 0 CIFS ops, 0 HTTP ops ... Tue Feb 16 15:00:01 CST [statd]: 3:00pm up 2 days, 7:34 28486473 NFS ops, 0 CIFS ops, 0 HTTP ops Tue Feb 16 15:17:09 CST [isp2100_main]: Disk 0a.29(0xfffffc0000bde6d0): opcode=0x2a sector 5666184 aborted command (b 47, 0) Tue Feb 16 15:19:32 CST [isp2100_main]: Disk 0a.36(0xfffffc0000c42290): opcode=0x2a sector 5682248 aborted command (b 47, 0) Tue Feb 16 16:00:01 CST [statd]: 4:00pm up 2 days, 8:34 29062201 NFS ops, 0 CIFS ops, 0 HTTP ops Tue Feb 16 17:00:00 CST [statd]: 5:00pm up 2 days, 9:34 29525471 NFS ops, 0 CIFS ops, 0 HTTP ops Tue Feb 16 17:05:11 CST [isp2100_main]: Disk 0a.11(0xfffffc0000c50610): opcode=0x28 sector 0 not ready (2 4, 1) Tue Feb 16 17:05:11 CST [isp2100_main]: Disk 0a.11(0xfffffc0000bd2d90): opcode=0x28 sector 0 not ready (2 4, 1) Tue Feb 16 17:05:55 CST [isp2100_main]: Disk 0a.11(0xfffffc0000c6c830): opcode=0x28 sector 0 not ready (2 4, 1) Tue Feb 16 17:05:55 CST [isp2100_main]: Disk 0a.11(0xfffffc0000bbff50): opcode=0x28 sector 0 not ready (2 4, 1) Tue Feb 16 17:05:55 CST [isp2100_main]: Disk 0a.11(0xfffffc0000bb20b0): opcode=0x28 sector 0 not ready (2 4, 1) Tue Feb 16 17:05:55 CST [isp2100_main]: Disk 0a.11(0xfffffc0000c00730): opcode=0x28 sector 0 not ready (2 4, 1) Tue Feb 16 17:05:55 CST [isp2100_main]: Disk 0a.11(0xfffffc0000c15fb0): opcode=0x28 sector 0 not ready (2 4, 1) Tue Feb 16 17:05:55 CST [isp2100_main]: Disk 0a.11(0xfffffc0000bd2090): opcode=0x28 sector 0 not ready (2 4, 1) Tue Feb 16 18:00:01 CST [statd]: 6:00pm up 2 days, 10:34 29809864 NFS ops, 0 CIFS ops, 0 HTTP ops ... Thu Feb 18 13:00:01 CST [statd]: 1:00pm up 4 days, 5:33 48972544 NFS ops, 0 CIFS ops, 0 HTTP ops Thu Feb 18 13:43:50 CST [isp2100_main]: Disk 0a.41(0xfffffc0000c1b770): opcode=0x28 sector 633677 recovered error (1 18, 2) Thu Feb 18 14:00:01 CST [statd]: 2:00pm up 4 days, 6:33 50276094 NFS ops, 0 CIFS ops, 0 HTTP ops Thu Feb 18 14:02:47 CST [isp2100_main]: Disk 0a.41(0xfffffc0000c3eb50): opcode=0x28 sector 673220 recovered error (1 9, 0) Thu Feb 18 14:50:31 CST [isp2100_timeout]: 0a.14 (cmd opcode 0x2a, retry count 0): command timeout, aborting request Thu Feb 18 14:50:36 CST [isp2100_timeout]: isp2100_reset_device: 0a.14 (0.14) failed Thu Feb 18 14:50:37 CST [isp2100_timeout]: Resetting ISP2100 0a (ha #0) Thu Feb 18 15:00:00 CST [statd]: 3:00pm up 4 days, 7:33 51645917 NFS ops, 0 CIFS ops, 0 HTTP ops ... Fri Feb 19 15:00:01 CST [statd]: 3:00pm up 5 days, 7:33 56936830 NFS ops, 0 CIFS ops, 0 HTTP ops Fri Feb 19 15:25:30 CST [isp2100_main]: isp2100_error_proc: dev 0a.26 (0.26) data overrun occured (cmd opcode 0x28) ( 0x0 0xc28 0x0 ) Fri Feb 19 15:25:30 CST [isp2100_main]: Disk 0a.26: Data overrun (0xfffffc0000c3b0d0,0x28,0) Fri Feb 19 15:25:32 CST [edm_admin]: No valid paths to Enclosure Services in shelf 1 on ha 0a. Fri Feb 19 15:25:33 CST [edm_admin]: No valid paths to Enclosure Services in shelf 0 on ha 0a. Fri Feb 19 15:25:33 CST [edm_admin]: No valid paths to Enclosure Services in shelf 2 on ha 0a. Fri Feb 19 15:25:33 CST [edm_admin]: No valid paths to Enclosure Services in shelf 3 on ha 0a. Fri Feb 19 15:25:33 CST [edm_admin]: No valid paths to Enclosure Services in shelf 4 on ha 0a. Fri Feb 19 15:25:33 CST [edm_admin]: No valid paths to Enclosure Services in shelf 5 on ha 0a. Fri Feb 19 15:25:33 CST [edm_admin]: No valid paths to Enclosure Services in shelf 6 on ha 0a. Fri Feb 19 15:25:33 CST [edm_admin]: No valid paths to Enclosure Services in shelf 7 on ha 0a. ... Fri Feb 19 21:30:49 GMT [rc]: de_main: e3c : Link up. Fri Feb 19 21:30:55 GMT [rc]: de_main: e3d : Link up. Fri Feb 19 15:30:56 CST [rc]: saving 1045M to /etc/crash/core.1 ("../common/wafl/dir.c:993: Assertion failure.") Fri Feb 19 15:32:31 CST [tn_login_0]: Login from host: cdsd01 Fri Feb 19 15:33:13 CST [tn_login_0]: Login from host: melon10.msp.sc.ti.com
Fri Feb 19 15:33:22 CST [rc]: relog syslog Fri Feb 19 15:25:33 CST [edm_admin]: No valid paths to Enclosure Services in shelf 7 on ha 0a.
Fri Feb 19 15:33:22 CST [rc]: NetApp Release 5.1.1 boot complete. Last disk update written at Fri Feb 19 15:30:30 CST 1999 Fri Feb 19 15:33:24 CST [asup_main]: System Notification mail sent Fri Feb 19 16:00:01 CST [statd]: 4:00pm up 29 mins, 145079 NFS ops, 0 CIFS ops, 0 HTTP ops Fri Feb 19 16:01:36 CST [isp2100_main]: Disk 0a.41(0xfffffc0000bf8bb0): opcode=0x2a sector 225337 recovered error (1 3, 0) Fri Feb 19 17:00:01 CST [statd]: 5:00pm up 1:29 553443 NFS ops, 0 CIFS ops, 0 HTTP ops Fri Feb 19 18:00:01 CST [statd]: 6:00pm up 2:29 1762570 NFS ops, 0 CIFS ops, 0 HTTP ops Fri Feb 19 18:33:17 CST [isp2100_main]: Disk 0a.41(0xfffffc0000ba94f0): opcode=0x2a sector 593920 recovered error (1 3, 0) Fri Feb 19 19:00:01 CST [statd]: 7:00pm up 3:29 2474457 NFS ops, 0 CIFS ops, 0 HTTP ops ... Sat Feb 20 12:00:00 CST [statd]: 12:00pm up 20:29 8909278 NFS ops, 0 CIFS ops, 0 HTTP ops Sat Feb 20 18:31:47 GMT [sm_recover]: no address for host <cdsd08> Sat Feb 20 18:31:47 GMT [sm_recover]: no address for host <cdsd07> ... Sat Feb 20 18:32:02 GMT [rc]: de_main: e3c : Link up. Sat Feb 20 18:32:08 GMT [rc]: de_main: e3d : Link up. Sat Feb 20 12:32:09 CST [rc]: saving 1045M to /etc/crash/core.2 ("wafl_check_vbns: vbn too big") Sat Feb 20 12:34:37 CST [rc]: relog syslog Sat Feb 20 12:27:53 CST [isp2100_main]: Disk 0a.16: Data overrun (0xfffffc0000c5d130,0x28,0)
Sat Feb 20 12:34:37 CST [rc]: NetApp Release 5.1.1 boot complete. Last disk update written at Sat Feb 20 12:31:42 CST 1999
Sat Feb 20 12:34:39 CST [asup_main]: System Notification mail sent Sat Feb 20 12:55:27 CST [isp2100_main]: isp2100_error_proc: dev 0a.9 (0.9) data overrun occured (cmd opcode 0x28) ( 0x0 0xc28 0x0 ) Sat Feb 20 12:55:27 CST [isp2100_main]: Disk 0a.9: Data overrun (0xfffffc0000c54570,0x28,0) Sat Feb 20 18:59:22 GMT [sm_recover]: no address for host <cdsd08> Sat Feb 20 18:59:22 GMT [sm_recover]: no address for host <cdsd07> ... Sat Feb 20 18:59:38 GMT [rc]: de_main: e3c : Link up. Sat Feb 20 18:59:43 GMT [rc]: de_main: e3d : Link up. Sat Feb 20 12:59:45 CST [rc]: saving 1045M to /etc/crash/core.3 ("wafl_check_vbns: vbn too big") Sat Feb 20 13:02:13 CST [rc]: relog syslog Sat Feb 20 12:55:27 CST [isp2100_main]: Disk 0a.9: Data overrun (0xfffffc0000c54570,0x28,0)
Sat Feb 20 13:02:13 CST [rc]: NetApp Release 5.1.1 boot complete. Last disk update written at Sat Feb 20 12:59:19 CST 1999
Graham Knight wrote:
Last week we had a fiber channel F760 completely crash and burn. Right now the system is in a state where it will serve data for 10 or 15 minutes and then it reports inconsistent parity errors and crashes. We will be sending it to Netapp for failure analysis.
That was a nightmare to say the least - and, of course, it happened while i was on vacation. (My first thought was that somebody was mad at me for flaming about the disk firmware thing and decided to take it out on me personally - yikes - i swear off flaming for the rest of my life). Anyway, last night another of our filers crashed with the same type of errors. This is an F630 with FC disks. I called NetApp - just got off the phone with them - they are looking into it, but before i lose another filer i thought i'd bounce this one off of the user community.
These are the types of errors it spits out:
Sun Feb 21 04:22:00 MST [isp2100_main]: isp2100_error_proc: dev 8.11 (0.11) data underrun (cmd opcode 0x28) ( 0x0 0x5c28 0x8000 ) Sun Feb 21 04:22:00 MST [isp2100_main]: Disk 8.11: Data underrun (0xfffffc0000c204f0,0x28,0)
a few of those then a few of these:
Sun Feb 21 04:22:00 MST [raid_stripe_owner]: Inconsistent parity on volume vol0, RAID group 2, stripe #377824. Sun Feb 21 04:22:00 MST [raid_stripe_owner]: Rewriting bad parity block on volume vol0, RAID group 2, stripe #377824.
Then:
Sun Feb 21 04:23:22 MST [edm_admin]: No valid paths to Enclosure Services in shelf 2 on ha 8. Sun Feb 21 04:23:22 MST [raid_stripe_owner]: Inconsistent parity on volume vol0, RAID group 2, stripe #377861. Sun Feb 21 04:23:22 MST [raid_stripe_owner]: Out of messages; cannot dump contents of inconsistent stripe. Sun Feb 21 04:24:00 MST [edm_admin]: No valid paths to Enclosure Services in shelf 3 on ha 8. Sun Feb 21 11:30:36 GMT [rc]: de_main: e10 : Link up. Sun Feb 21 04:30:37 MST [rc]: saving 44M to /etc/crash/core.0.nz ("wafl_check_vbns: vbn too big")
Netapp thinks the previous failure may have been due to heat in the computer room - BUT - the filer *never* spit out any thermal warnings - I'm not too sure if i believe that heat was the culprit - i can run it for hours doing disk scrubs, but as soon as i use it for NIS it crashes. This system is in a different room and has a beter ventilated cabinet - again, no thermal warnings in the messages file - but maybe the thermal warning messages don't work? Just out of curiosity - has anyone ever seen a FC system spit out a thermal warning?? (5.1.2P2)
Has anyone experienced these types of problem with FC filers? I've never had a problem with SCSI filers - but these FC ones just seem flaky to me....
Thanks for any insight anyone can lend...
Graham