New subject: RAID not reconstructing on FAS3050c

10 May 2013


      Thanks all! So far I have been able to get into the node shell and
determine that the aggregate is marked as "wafl inconsistent".
The disks are all the same size and model (SATA connected via FC, NetApp
X267_MGRIZ500SSX), except for one, which is model X267_HKURO500SSX -- it
appears to be of the same size as the rest, however, and is currently a
good data disk.
Around the time of the failure, this message appears in the messages.ems
log:
[config_thread]: raid.rg.recons.cantStart: The reconstruction cannot start
in RAID group /engdata1/plex0/rg2: No 520 bps disk of required size
available in spare pool
That might explain why it's not reconstructing. The full messages.ems log
is here: http://pastebin.com/b9Xgj628 Also here is the output of 'aggr
status -r': http://pastebin.com/MmHFNbiz
I'm guessing that what I need to do is run WAFL_check? Anything else I
should look at before I run off and do this?
On Thu, May 9, 2013 at 2:03 PM, Doug Siggins DSiggins@ma.maileig.comwrote:
...
This is GX, he doesn't have the option to view the messages file without
logging into the CLI:
root@mgmt1 ip, cd /mroot/log
there you can cat messages.log, and to view the messages.ems use
ems_logviewer command:
Last login: Wed May  8 20:01:17 2013 from 10.1.91.10
bosback01# cd /mroot/log
bosback01# ems_logviewer messages.ems
That will give some more information. Although I have never had this issue
I have had GX from 10.0 to 10.0.4 In fact if you are still on GX, I have no
idea why you would stick with 10.0.1, its horribly unstable!
you can also get more information via storage disk show toast1a:0d.(number)
One thing, those look like 450/500G disks.
Are you using the same type of disk in the aggregate? (SATA/FC?)
You may need to login to the dbladecli and allow the aggregate to use a
different sized disk to rebuild onto (options disk or raid or something)
Or the other thing I can think of is that you put an unsupported disk into
that shelf. I know when I put in the 450G 15k FC I had to load up a new
"unsupported" disk_qual package to see all of the space and use them. Which
given Netapps congenial attitude towards screwing us GX users with FAS3050s
didn't bother me the most to hack to get working :)
I have been through a lot with GX/C-mode, and know quite a bit. I am
looking for others who know things like zsmcli, and vldbtest :)

*From:* toasters-bounces@teaparty.net [toasters-bounces@teaparty.net] on
behalf of tmac [tmacmd@gmail.com]
*Sent:* Thursday, May 09, 2013 4:45 PM
*To:* Chris Daniel
*Cc:* Toasters
*Subject:* Re: RAID not reconstructing on FAS3050c
Try looking at it from the node perspective:
(i forget the syntax offhand as it is slightly different that Ontap 8.1+
Cluster Mode)
After you get into the node shell, do a disk show -n (make sure disks
are properly assigned)
try a aggr status -r
Maybe the spares are zeroing before they are being used...that may be
what pending is.
Have you checked the event log? what about the messages file?
--tmac
*Tim McCarthy*
*Principal Consultant*
Clustered ONTAP
 Clustered ONTAP
 NCDA ID: XK7R3GEKC1QQ2LVD        RHCE5 805007643429572      NCSIE ID:
C14QPHE21FR4YWD4
  Expires: 08 November 2014                 Expires w/release of RHEL7
   Expires: 08 November 2014
On Thu, May 9, 2013 at 4:32 PM, Chris Daniel cjdaniel@gmail.com wrote:
...
Hello,
We've got a couple of FAS3050c filers running 4 DS14mk2 shelves full of
disks (each filer is connected to two disk shelves). They've mostly been
trouble-free, but this week it seems a disk failed, and our main storage
aggregate went into "degraded" mode. For some reason, despite spare disks
being available, it's not reconstructing as I would think it should.
The software running on these filers is Data ONTAP GX 10.0.1P2 -- from
previous discussions with the community, I've learned that GX has a whole
different set of commands, so many of the Google-able resources I've found
aren't relevant. Adding to that difficulty, we don't have a support
contract on these filers (but they are properly licensed and whathaveyou).
Here is the output of 'storage aggregate show -aggregate engdata1'
(that is the degraded aggregate):
toast1a::> storage aggregate show -aggregate engdata1
      Aggregate: engdata1
      Size (MB): 0
 Used Size (MB): 0
Used Percentage: -

Available Size (MB): 0
              State: restricted
              Nodes: toast1a
    Number Of Disks: 37
              Disks: toast1a:0a.16, toast1a:0b.32, toast1a:0c.48,
                     toast1a:0a.17, toast1a:0b.33, toast1a:0c.49,
                     toast1a:0a.18, toast1a:0b.34, toast1a:0c.50,
                     toast1a:0a.19, toast1a:0b.35, toast1a:0c.51,
                     toast1a:0a.20, toast1a:0d.64, toast1a:0b.37,
                     toast1a:0a.21, toast1a:0c.52, toast1a:0b.38,
                     toast1a:0a.22, toast1a:0c.61, toast1a:0b.39,
                     toast1a:0d.69, toast1a:0c.54, toast1a:0b.40,
                     toast1a:0a.24, toast1a:0c.55, toast1a:0d.65,
                     toast1a:0a.25, toast1a:0a.26, toast1a:0b.42,
                     toast1a:0c.59, toast1a:0a.27, toast1a:0b.43,
                     toast1a:0a.28, toast1a:0b.45, toast1a:0d.68,
toast1a:0d.71
  Number Of Volumes: 0
             Plexes: /engdata1/plex0(online)
        RAID Groups: /engdata1/plex0/rg0, /engdata1/plex0/rg1,
                     /engdata1/plex0/rg2
          Raid Type: raid_dp
      Max RAID Size: 14
        RAID Status: raid_dp,degraded
   Checksum Enabled: true
    Checksum Status: active
     Checksum Style: block
       Inconsistent: true
       Volume Types: flex
There are spare disks available now, but there were not when the
failure occurred. I moved two spare disks to the right filer after the
failure, thinking that would cause the aggregate to start reconstructing.
Here is the output of 'storage disk show -state spare':
toast1a::> storage disk show -state spare
Disk             UsedSize(MB) Shelf Bay State     RAID Type  Aggregate
Owner


toast1a:0d.72    423090           4   8 spare     pending    -
toast1a
toast1a:0d.73    423090           4   9 spare     pending    -
toast1a
toast1b:0d.74    423090           4  10 spare     pending    -
toast1b
toast1b:0d.75    423090           4  11 spare     pending    -
toast1b
toast1b:0d.76    423090           4  12 spare     pending    -
toast1b
toast1b:0d.77    423090           4  13 spare     pending    -
toast1b
6 entries were displayed.
Can anyone provide insight on this problem? Why is the aggregate not
reconstructing when there are spares available? NetApp stuff is not my
specialty, but I'm the one who gets to deal with it, and I am pretty
stumped. Thank you in advance!
--
Chris Daniel

Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters
-- 
Chris Daniel