crashes due to corrupt volume (now resolved by upgrade) - toasters

25 Aug 2011


      Along the lines of posting stuff we tried to post to the list in the
past, here's some information about a situation we had recently, where
our NetApp was crashing over and over, resulting in 4 or so consecutive
sleepless nights for me. The situation was resolved in the short term by
upgrading Data OnTAP from 7.1.1 (built 6/2006) to 7.3.3. Sadly, this was
also the impetus to speed up migrating the data to our new EMC based
infrastructure (not my decision).
Posting here just in case it's helpful to anyone.
Interestingly, the errors seemed to continue after the upgrade, however
something dumped core, but the system didn't crash; note also that the
debugging information is more specific about which volume the problem is
on than in the log messages from before the upgrade:
Wed Jul 27 19:34:51 PDT [raid.data.ws.blkErr:error]: WAFL sanity check detected
bad data on volume acs, Disk /aggr0/plex0/rg1/0a.41 Shelf 2 Bay 9  [NETAPP
X274_S10K7146F10 NA07] S/N [3KS0R58B000075418D6B], block 3765221, inode number
-45, snapid 0, file block 690030, level 1.
Wed Jul 27 19:34:51 PDT [raid.data.lw.blkErr:CRITICAL]: Bad data detected on
Disk /aggr0/plex0/rg1/0a.41 Shelf 2 Bay 9 [NETAPP   X274_S10K7146F10 NA07] S/N
[3KS0R58B000075418D6B], block #3765221
Wed Jul 27 19:34:51 PDT [wafl.inconsistent.indirect:error]: Bad indirect block
in vol 'acs', snap 13, fbn 690030, level 1, fileid -1.
Wed Jul 27 19:34:51 PDT [wafl.inconsistent.vol:error]: WAFL inconsistent: volume
acs is inconsistent. Note: Any new Snapshot copies might contain this
inconsistency.
Wed Jul 27 19:34:51 PDT [wafl.raid.incons.buf:error]: WAFL inconsistent: bad
block 143002981 (vvbn:97941303 fbn:690030 level:1) in inode (fileid: 4294967251
snapid:0 file_type:1 disk_flags:0x2) in volume acs.
Wed Jul 27 19:34:51 PDT [coredump.micro.completed:info]: Microcore
(/etc/crash/micro-core.101186401.2011-07-28.02_34_51) generation completed
Sorry about the unfinished sentence at the beginning. I was running on
very little sleep at the time.
----- Forwarded message from William Yardley toasters@veggiechinese.net -----
From: William Yardley toasters@veggiechinese.net
To: toasters@mathworks.com
Subject: crashes due to corrupt volume?
I have a FAS3020 which has been crashing repeatedly. It has a single
aggregate (aggr0) with
Doing WAFL_check fixed a few errors, but after that ran clean. wafliron
from the maintenance menu also came back clean. However, we kept seeing
errors like this, even after replacing the disks which were supposedly
having problems (leading me to believe it's something on the volume
that's corrupt rather than a physical problem). We did have a disk
failure, and that disk already rebuilt to the spare.
The problem
Mon Jul 25 02:19:44 PDT [raid.data.ws.blkErr:error]: WAFL sanity check detected bad data on volume aggr0, Disk /aggr0/plex0/rg1/0a.41 Shelf 2 Bay 9 [NETAPP   X274_SCHT6146F10 NA08] S/N [3HY8FV2S00007418EU6V], block 3765221, inode number -45, snapid 0, file block 690030, level 1.
Mon Jul 25 02:40:10 PDT [raid.data.lw.blkErr:CRITICAL]: Bad data detected on Disk /aggr0/plex0/rg1/0a.41 Shelf 2 Bay 9 [NETAPP   X274_SCHT6146F10 NA08] S/N [3HY8FV2S00007418EU6V], block #3765221
Mon Jul 25 02:40:10 PDT [raid.rg.readerr.repair.data:notice]: Fixing bad data on Disk /aggr0/plex0/rg1/0a.41 Shelf 2 Bay 9 [NETAPP   X274_SCHT6146F10 NA08] S/N [3HY8FV2S00007418EU6V], block #3765221
[and later, note a few different SNs after we replaced the two
drives which the errors were showing up on -- we did this by doing a
fail -i on the drives, and then reconstructing one at a time.]
Mon Jul 25 21:44:45 PDT [raid.data.ws.blkErr:error]: WAFL sanity check detected bad data on volume aggr0, Disk /aggr0/plex0/rg1/0a.40 Shelf 2 Bay 8 [NETAPP   X274_SCHT6146F10 NA08] S/N [3HY8FRM000007451DZZ2], block 3765221, inode number -45, snapid 0, file block 690030, level 1.
Tue Jul 26 04:15:29 PDT [raid.data.lw.blkErr:CRITICAL]: Bad data detected on Disk /aggr0/plex0/rg1/0c.41 Shelf 2 Bay 9 [NETAPP   X274_S10K7146F10 NA07] S/N [3KS0R58B000075418D6B], block #3765221
Tue Jul 26 04:15:29 PDT [raid.rg.readerr.repair.data:notice]: Fixing bad data on Disk /aggr0/plex0/rg1/0c.41 Shelf 2 Bay 9 [NETAPP   X274_S10K7146F10 NA07] S/N [3KS0R58B000075418D6B], block #3765221
also a bunch of these:
Tue Jul 26 04:16:30 PDT [wafl.dir.inconsistent:error]: wafl_nfs_lookup: inconsistent directory entry {x20 13 14495183 105181080 -520015383} <C7DA730C028D> in {x20 13 2781660 870430057 -520015383}. WAFL may be inconsistent. Call NetApp Support.
Tue Jul 26 04:16:30 PDT [wafl.dir.inconsistent:error]: wafl_nfs_lookup: inconsistent directory entry {x20 13 14495184 105181084 -520015383} <C7DDF30B0808> in {x20 13 2781660 870430057 -520015383}. WAFL may be inconsistent. Call NetApp Support.
The parity inconsistency files in etc/crash/ show something like:
Checksum mismatch on disk /aggr0/plex0/rg1/0a.41 (S/N
3HY8FV2S00007418EU6V), block #3765221 (sector #0x1cb9f28).
Checksum mismatch found on Mon Jul 25 02:30:00 PDT 2011
Expected block:
[...]
The system would often crash, and then crash again on reboot -- booting
without /etc/rc and *then* enabling services seemed to help, though the
system would usually crash again within hours.
After looking at a (very) old post on this list, I tried disabling
quotas on the active volumes; this seems to have possibly helped, but
I'm nervous that the system will crash again. Short of restoring to a
snapshot from before the problem started, is there anything else (online
or offline) I can do to fix any problems that might be on the volume, or
manually remove the problematic block(s)?
Some other info, in case it helps:
files*> aggr status -v
           Aggr State      Status            Options
          aggr0 online     raid_dp, aggr     root, diskroot, nosnap=off,
                                             raidtype=raid_dp,
                                             raidsize=16,
                                             snapmirrored=off,
                                             resyncsnaptime=60,
                                             fs_size_fixed=off,
                                             snapshot_autodelete=on,
                                             lost_write_protect=on
Volumes: m_hosts, luns, acs, root, tmp, software, home, 
    	         legacy_mail, x_tmt, backups, dropbox, netflow2, 
    	         xen_backup
Plex /aggr0/plex0: online, normal, active
                    RAID group /aggr0/plex0/rg0: normal
                    RAID group /aggr0/plex0/rg1: normal
files*> aggr status -r
Aggregate aggr0 (online, raid_dp) (block checksums)
  Plex /aggr0/plex0 (online, normal, active)
    RAID group /aggr0/plex0/rg0 (normal)
RAID Disk	Device	HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      ---------	------	------------- ---- ---- ---- ----- --------------    --------------
      dparity 	0a.22	0a    1   6   FC:A   -  FCAL 10000 136000/278528000  137104/280790184 
      parity  	0a.32	0a    2   0   FC:A   -  FCAL 10000 136000/278528000  137104/280790184 
      data    	0a.23	0a    1   7   FC:A   -  FCAL 10000 136000/278528000  139072/284820800 
      data    	0a.39	0a    2   7   FC:A   -  FCAL 10000 136000/278528000  137104/280790184 
      data    	0c.35	0c    2   3   FC:B   -  FCAL 10000 136000/278528000  137104/280790184 
      data    	0c.26	0c    1   10  FC:B   -  FCAL 10000 136000/278528000  137104/280790184 
      data    	0a.45	0a    2   13  FC:A   -  FCAL 10000 136000/278528000  137104/280790184 
      data    	0c.18	0c    1   2   FC:B   -  FCAL 10000 136000/278528000  137104/280790184 
      data    	0c.28	0c    1   12  FC:B   -  FCAL 10000 136000/278528000  137104/280790184 
      data    	0c.37	0c    2   5   FC:B   -  FCAL 10000 136000/278528000  137104/280790184 
      data    	0a.24	0a    1   8   FC:A   -  FCAL 10000 136000/278528000  137104/280790184 
      data    	0a.43	0a    2   11  FC:A   -  FCAL 10000 136000/278528000  137104/280790184 
      data    	0c.29	0c    1   13  FC:B   -  FCAL 10000 136000/278528000  137485/281570072
RAID group /aggr0/plex0/rg1 (normal)
RAID Disk	Device	HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      ---------	------	------------- ---- ---- ---- ----- --------------    --------------
      dparity 	0a.17	0a    1   1   FC:A   -  FCAL 10000 136000/278528000  137422/281442144 
      parity  	0a.33	0a    2   1   FC:A   -  FCAL 10000 136000/278528000  139072/284820800 
      data    	0c.41	0c    2   9   FC:B   -  FCAL 10000 136000/278528000  137104/280790184 
      data    	0a.38	0a    2   6   FC:A   -  FCAL 10000 136000/278528000  137104/280790184 
      data    	0a.36	0a    2   4   FC:A   -  FCAL 10000 136000/278528000  137104/280790184 
      data    	0c.27	0c    1   11  FC:B   -  FCAL 10000 136000/278528000  137104/280790184 
      data    	0a.21	0a    1   5   FC:A   -  FCAL 10000 136000/278528000  137104/280790184 
      data    	0a.34	0a    2   2   FC:A   -  FCAL 10000 136000/278528000  137104/280790184 
      data    	0c.19	0c    1   3   FC:B   -  FCAL 10000 136000/278528000  137104/280790184 
      data    	0a.20	0a    1   4   FC:A   -  FCAL 10000 136000/278528000  137104/280790184 
      data    	0c.44	0c    2   12  FC:B   -  FCAL 10000 136000/278528000  137422/281442144 
      data    	0c.16	0c    1   0   FC:B   -  FCAL 10000 136000/278528000  137104/280790184 
      data    	0a.42	0a    2   10  FC:A   -  FCAL 10000 136000/278528000  137104/280790184
Spare disks
RAID Disk	Device	HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
---------	------	------------- ---- ---- ---- ----- --------------    --------------
Spare disks for block or zoned checksum traditional volumes or aggregates
spare   	0c.25	0c    1   9   FC:B   -  FCAL 10000 136000/278528000  137104/280790184 
spare   	0c.40	0c    2   8   FC:B   -  FCAL 10000 136000/278528000  137104/280790184
----- End forwarded message -----