Along the lines of posting stuff we tried to post to the list in the past, here's some information about a situation we had recently, where our NetApp was crashing over and over, resulting in 4 or so consecutive sleepless nights for me. The situation was resolved in the short term by upgrading Data OnTAP from 7.1.1 (built 6/2006) to 7.3.3. Sadly, this was also the impetus to speed up migrating the data to our new EMC based infrastructure (not my decision).
Posting here just in case it's helpful to anyone.
Interestingly, the errors seemed to continue after the upgrade, however something dumped core, but the system didn't crash; note also that the debugging information is more specific about which volume the problem is on than in the log messages from before the upgrade:
Wed Jul 27 19:34:51 PDT [raid.data.ws.blkErr:error]: WAFL sanity check detected bad data on volume acs, Disk /aggr0/plex0/rg1/0a.41 Shelf 2 Bay 9 [NETAPP X274_S10K7146F10 NA07] S/N [3KS0R58B000075418D6B], block 3765221, inode number -45, snapid 0, file block 690030, level 1. Wed Jul 27 19:34:51 PDT [raid.data.lw.blkErr:CRITICAL]: Bad data detected on Disk /aggr0/plex0/rg1/0a.41 Shelf 2 Bay 9 [NETAPP X274_S10K7146F10 NA07] S/N [3KS0R58B000075418D6B], block #3765221 Wed Jul 27 19:34:51 PDT [wafl.inconsistent.indirect:error]: Bad indirect block in vol 'acs', snap 13, fbn 690030, level 1, fileid -1. Wed Jul 27 19:34:51 PDT [wafl.inconsistent.vol:error]: WAFL inconsistent: volume acs is inconsistent. Note: Any new Snapshot copies might contain this inconsistency. Wed Jul 27 19:34:51 PDT [wafl.raid.incons.buf:error]: WAFL inconsistent: bad block 143002981 (vvbn:97941303 fbn:690030 level:1) in inode (fileid: 4294967251 snapid:0 file_type:1 disk_flags:0x2) in volume acs. Wed Jul 27 19:34:51 PDT [coredump.micro.completed:info]: Microcore (/etc/crash/micro-core.101186401.2011-07-28.02_34_51) generation completed
Sorry about the unfinished sentence at the beginning. I was running on very little sleep at the time.
----- Forwarded message from William Yardley toasters@veggiechinese.net -----
From: William Yardley toasters@veggiechinese.net To: toasters@mathworks.com Subject: crashes due to corrupt volume?
I have a FAS3020 which has been crashing repeatedly. It has a single aggregate (aggr0) with
Doing WAFL_check fixed a few errors, but after that ran clean. wafliron from the maintenance menu also came back clean. However, we kept seeing errors like this, even after replacing the disks which were supposedly having problems (leading me to believe it's something on the volume that's corrupt rather than a physical problem). We did have a disk failure, and that disk already rebuilt to the spare.
The problem
Mon Jul 25 02:19:44 PDT [raid.data.ws.blkErr:error]: WAFL sanity check detected bad data on volume aggr0, Disk /aggr0/plex0/rg1/0a.41 Shelf 2 Bay 9 [NETAPP X274_SCHT6146F10 NA08] S/N [3HY8FV2S00007418EU6V], block 3765221, inode number -45, snapid 0, file block 690030, level 1. Mon Jul 25 02:40:10 PDT [raid.data.lw.blkErr:CRITICAL]: Bad data detected on Disk /aggr0/plex0/rg1/0a.41 Shelf 2 Bay 9 [NETAPP X274_SCHT6146F10 NA08] S/N [3HY8FV2S00007418EU6V], block #3765221 Mon Jul 25 02:40:10 PDT [raid.rg.readerr.repair.data:notice]: Fixing bad data on Disk /aggr0/plex0/rg1/0a.41 Shelf 2 Bay 9 [NETAPP X274_SCHT6146F10 NA08] S/N [3HY8FV2S00007418EU6V], block #3765221
[and later, note a few different SNs after we replaced the two drives which the errors were showing up on -- we did this by doing a fail -i on the drives, and then reconstructing one at a time.]
Mon Jul 25 21:44:45 PDT [raid.data.ws.blkErr:error]: WAFL sanity check detected bad data on volume aggr0, Disk /aggr0/plex0/rg1/0a.40 Shelf 2 Bay 8 [NETAPP X274_SCHT6146F10 NA08] S/N [3HY8FRM000007451DZZ2], block 3765221, inode number -45, snapid 0, file block 690030, level 1.
Tue Jul 26 04:15:29 PDT [raid.data.lw.blkErr:CRITICAL]: Bad data detected on Disk /aggr0/plex0/rg1/0c.41 Shelf 2 Bay 9 [NETAPP X274_S10K7146F10 NA07] S/N [3KS0R58B000075418D6B], block #3765221 Tue Jul 26 04:15:29 PDT [raid.rg.readerr.repair.data:notice]: Fixing bad data on Disk /aggr0/plex0/rg1/0c.41 Shelf 2 Bay 9 [NETAPP X274_S10K7146F10 NA07] S/N [3KS0R58B000075418D6B], block #3765221
also a bunch of these: Tue Jul 26 04:16:30 PDT [wafl.dir.inconsistent:error]: wafl_nfs_lookup: inconsistent directory entry {x20 13 14495183 105181080 -520015383} <C7DA730C028D> in {x20 13 2781660 870430057 -520015383}. WAFL may be inconsistent. Call NetApp Support. Tue Jul 26 04:16:30 PDT [wafl.dir.inconsistent:error]: wafl_nfs_lookup: inconsistent directory entry {x20 13 14495184 105181084 -520015383} <C7DDF30B0808> in {x20 13 2781660 870430057 -520015383}. WAFL may be inconsistent. Call NetApp Support.
The parity inconsistency files in etc/crash/ show something like: Checksum mismatch on disk /aggr0/plex0/rg1/0a.41 (S/N 3HY8FV2S00007418EU6V), block #3765221 (sector #0x1cb9f28). Checksum mismatch found on Mon Jul 25 02:30:00 PDT 2011
Expected block: [...]
The system would often crash, and then crash again on reboot -- booting without /etc/rc and *then* enabling services seemed to help, though the system would usually crash again within hours.
After looking at a (very) old post on this list, I tried disabling quotas on the active volumes; this seems to have possibly helped, but I'm nervous that the system will crash again. Short of restoring to a snapshot from before the problem started, is there anything else (online or offline) I can do to fix any problems that might be on the volume, or manually remove the problematic block(s)?
Some other info, in case it helps: files*> aggr status -v Aggr State Status Options aggr0 online raid_dp, aggr root, diskroot, nosnap=off, raidtype=raid_dp, raidsize=16, snapmirrored=off, resyncsnaptime=60, fs_size_fixed=off, snapshot_autodelete=on, lost_write_protect=on
Volumes: m_hosts, luns, acs, root, tmp, software, home, legacy_mail, x_tmt, backups, dropbox, netflow2, xen_backup
Plex /aggr0/plex0: online, normal, active RAID group /aggr0/plex0/rg0: normal RAID group /aggr0/plex0/rg1: normal
files*> aggr status -r Aggregate aggr0 (online, raid_dp) (block checksums) Plex /aggr0/plex0 (online, normal, active) RAID group /aggr0/plex0/rg0 (normal)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0a.22 0a 1 6 FC:A - FCAL 10000 136000/278528000 137104/280790184 parity 0a.32 0a 2 0 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0a.23 0a 1 7 FC:A - FCAL 10000 136000/278528000 139072/284820800 data 0a.39 0a 2 7 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0c.35 0c 2 3 FC:B - FCAL 10000 136000/278528000 137104/280790184 data 0c.26 0c 1 10 FC:B - FCAL 10000 136000/278528000 137104/280790184 data 0a.45 0a 2 13 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0c.18 0c 1 2 FC:B - FCAL 10000 136000/278528000 137104/280790184 data 0c.28 0c 1 12 FC:B - FCAL 10000 136000/278528000 137104/280790184 data 0c.37 0c 2 5 FC:B - FCAL 10000 136000/278528000 137104/280790184 data 0a.24 0a 1 8 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0a.43 0a 2 11 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0c.29 0c 1 13 FC:B - FCAL 10000 136000/278528000 137485/281570072
RAID group /aggr0/plex0/rg1 (normal)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0a.17 0a 1 1 FC:A - FCAL 10000 136000/278528000 137422/281442144 parity 0a.33 0a 2 1 FC:A - FCAL 10000 136000/278528000 139072/284820800 data 0c.41 0c 2 9 FC:B - FCAL 10000 136000/278528000 137104/280790184 data 0a.38 0a 2 6 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0a.36 0a 2 4 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0c.27 0c 1 11 FC:B - FCAL 10000 136000/278528000 137104/280790184 data 0a.21 0a 1 5 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0a.34 0a 2 2 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0c.19 0c 1 3 FC:B - FCAL 10000 136000/278528000 137104/280790184 data 0a.20 0a 1 4 FC:A - FCAL 10000 136000/278528000 137104/280790184 data 0c.44 0c 2 12 FC:B - FCAL 10000 136000/278528000 137422/281442144 data 0c.16 0c 1 0 FC:B - FCAL 10000 136000/278528000 137104/280790184 data 0a.42 0a 2 10 FC:A - FCAL 10000 136000/278528000 137104/280790184
Spare disks
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- Spare disks for block or zoned checksum traditional volumes or aggregates spare 0c.25 0c 1 9 FC:B - FCAL 10000 136000/278528000 137104/280790184 spare 0c.40 0c 2 8 FC:B - FCAL 10000 136000/278528000 137104/280790184
----- End forwarded message -----