On Thu, May 9, 2013 at 6:48 PM, Doug Siggins <DSiggins@ma.maileig.com> wrote:

1. I hope you have a backup of this data! please make sure you at least have a mirror (on a different aggregate)!

Before anything else get those failed disks out of there and save them for a short bit, and yes you will need to run a WAFL check, although it should still rebuild while inconsistent.
Delete some cores off of the head:
ngsh*> coredump delete ? this might take a few tries, as it looks like its trying to save the core files :)

or you could always rm them from /mroot/kcores
I find that much quicker.

Clear the shelf fault:
May 5 07:00:00 [statd]: monitor.shelf.fault: Fault reported on disk storage shelf attached to channel 0b. Please check fans, power, and temperature.

Broken disks

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)

--------- ------ ------------- ---- ---- ---- ----- -------------- --------------

failed    0a.23 0a 1 7 FC:A - ATA 7200 423111/866531584 423889/868126304

failed    0b.41 0b 2 9 FC:A - ATA 7200 423111/866531584 423889/868126304

failed    0b.44 0b 2 12 FC:A - ATA 7200 423111/866531584 423889/868126304

failed    0c.53 0c 3 5 FC:B - ATA 7200 423111/866531584 423889/868126304

failed    0c.57 0c 3 9 FC:B - ATA 7200 423111/866531584 423889/868126304

failed    0c.60 0c 3 12 FC:B - ATA 7200 423111/866531584 423889/868126304

you can light up the failed disks like this:
storage disk setled -disk bosback01:0a.98 -action on

-action off to turn them off :)

The message below does explain the issue with it not rebuilding the 520 bps thing, though that is odd. they look iike the same darned disks. Whomever administered this machine over rode the inconsistent flag (which is a horrible idea). So essentially you have a volume online while the fs is inconsistent, and can run into issues where it panics. Turn that flag off.

bosback01::*> aggr modify -aggregate bosback01_aggr1 -ignore-inconsistent off

I think you can then offline the aggregate and run the wafl iron (dbladecli aggr wafliron) while the system is online. Of course its always safer to boot into maint mode (ctrl c while booting) and then run the wafl check. It's been so long since I've done that and I cannot login to maint mode to try it :)

https://kb.netapp.com/support/index?page=content&id=3013616

I'd need to research the 520 bps thing further to figure this out, though try the options raid.disktype raid.whatever under the dblade.

Of course if you run into deeper issues and you need some help I can be had for a somewhat cheap consulting fee :)

You have lots of issues here, and are a little precarious with data integrity. Again I hope you have a backup of the data. Have fun.

From: toasters-bounces@teaparty.net [toasters-bounces@teaparty.net] on behalf of Chris Daniel [cjdaniel@gmail.com]
Sent: Thursday, May 09, 2013 8:44 PM
To: toasters@teaparty.net

Subject: Re: RAID not reconstructing on FAS3050c

Thanks all! So far I have been able to get into the node shell and determine that the aggregate is marked as "wafl inconsistent".

The disks are all the same size and model (SATA connected via FC, NetApp X267_MGRIZ500SSX), except for one, which is model X267_HKURO500SSX -- it appears to be of the same size as the rest, however, and is currently a good data disk.

Around the time of the failure, this message appears in the messages.ems log:

[config_thread]: raid.rg.recons.cantStart: The reconstruction cannot start in RAID group /engdata1/plex0/rg2: No 520 bps disk of required size available in spare pool

That might explain why it's not reconstructing. The full messages.ems log is here: http://pastebin.com/b9Xgj628 Also here is the output of 'aggr status -r': http://pastebin.com/MmHFNbiz

I'm guessing that what I need to do is run WAFL_check? Anything else I should look at before I run off and do this?

On Thu, May 9, 2013 at 2:03 PM, Doug Siggins <DSiggins@ma.maileig.com> wrote:

This is GX, he doesn't have the option to view the messages file without logging into the CLI:

root@mgmt1 ip, cd /mroot/log
there you can cat messages.log, and to view the messages.ems use ems_logviewer command:

Last login: Wed May 8 20:01:17 2013 from 10.1.91.10
bosback01# cd /mroot/log
bosback01# ems_logviewer messages.ems

That will give some more information. Although I have never had this issue I have had GX from 10.0 to 10.0.4 In fact if you are still on GX, I have no idea why you would stick with 10.0.1, its horribly unstable!

you can also get more information via storage disk show toast1a:0d.(number)

One thing, those look like 450/500G disks.

Are you using the same type of disk in the aggregate? (SATA/FC?)

You may need to login to the dbladecli and allow the aggregate to use a different sized disk to rebuild onto (options disk or raid or something)

Or the other thing I can think of is that you put an unsupported disk into that shelf. I know when I put in the 450G 15k FC I had to load up a new "unsupported" disk_qual package to see all of the space and use them. Which given Netapps congenial attitude towards screwing us GX users with FAS3050s didn't bother me the most to hack to get working :)

I have been through a lot with GX/C-mode, and know quite a bit. I am looking for others who know things like zsmcli, and vldbtest :)

From: toasters-bounces@teaparty.net [toasters-bounces@teaparty.net] on behalf of tmac [tmacmd@gmail.com]
Sent: Thursday, May 09, 2013 4:45 PM
To: Chris Daniel
Cc: Toasters
Subject: Re: RAID not reconstructing on FAS3050c

Try looking at it from the node perspective:
(i forget the syntax offhand as it is slightly different that Ontap 8.1+ Cluster Mode)

After you get into the node shell, do a disk show -n (make sure disks are properly assigned)

try a aggr status -r

Maybe the spares are zeroing before they are being used...that may be what pending is.

Have you checked the event log? what about the messages file?

--tmac

Tim McCarthy

Principal Consultant



Clustered ONTAP Clustered ONTAP

NCDA ID: XK7R3GEKC1QQ2LVD RHCE5 805007643429572   NCSIE ID: C14QPHE21FR4YWD4

Expires: 08 November 2014 Expires w/release of RHEL7 Expires: 08 November 2014

On Thu, May 9, 2013 at 4:32 PM, Chris Daniel <cjdaniel@gmail.com> wrote:

Hello,

We've got a couple of FAS3050c filers running 4 DS14mk2 shelves full of disks (each filer is connected to two disk shelves). They've mostly been trouble-free, but this week it seems a disk failed, and our main storage aggregate went into "degraded" mode. For some reason, despite spare disks being available, it's not reconstructing as I would think it should.

The software running on these filers is Data ONTAP GX 10.0.1P2 -- from previous discussions with the community, I've learned that GX has a whole different set of commands, so many of the Google-able resources I've found aren't relevant. Adding to that difficulty, we don't have a support contract on these filers (but they are properly licensed and whathaveyou).

Here is the output of 'storage aggregate show -aggregate engdata1' (that is the degraded aggregate):

toast1a::> storage aggregate show -aggregate engdata1

Aggregate: engdata1
Size (MB): 0
Used Size (MB): 0
Used Percentage: -
Available Size (MB): 0
State: restricted
Nodes: toast1a
Number Of Disks: 37
Disks: toast1a:0a.16, toast1a:0b.32, toast1a:0c.48,
toast1a:0a.17, toast1a:0b.33, toast1a:0c.49,
toast1a:0a.18, toast1a:0b.34, toast1a:0c.50,
toast1a:0a.19, toast1a:0b.35, toast1a:0c.51,
toast1a:0a.20, toast1a:0d.64, toast1a:0b.37,
toast1a:0a.21, toast1a:0c.52, toast1a:0b.38,
toast1a:0a.22, toast1a:0c.61, toast1a:0b.39,
toast1a:0d.69, toast1a:0c.54, toast1a:0b.40,
toast1a:0a.24, toast1a:0c.55, toast1a:0d.65,
toast1a:0a.25, toast1a:0a.26, toast1a:0b.42,
toast1a:0c.59, toast1a:0a.27, toast1a:0b.43,
toast1a:0a.28, toast1a:0b.45, toast1a:0d.68, toast1a:0d.71
Number Of Volumes: 0
Plexes: /engdata1/plex0(online)
RAID Groups: /engdata1/plex0/rg0, /engdata1/plex0/rg1,
/engdata1/plex0/rg2
Raid Type: raid_dp
Max RAID Size: 14
RAID Status: raid_dp,degraded
Checksum Enabled: true
Checksum Status: active
Checksum Style: block
Inconsistent: true
Volume Types: flex

There are spare disks available now, but there were not when the failure occurred. I moved two spare disks to the right filer after the failure, thinking that would cause the aggregate to start reconstructing. Here is the output of 'storage disk show -state spare':

toast1a::> storage disk show -state spare

Disk UsedSize(MB) Shelf Bay State RAID Type Aggregate Owner

---------------- ------------ ----- --- --------- ---------- --------- --------

toast1a:0d.72 423090 4 8 spare pending - toast1a

toast1a:0d.73 423090 4 9 spare pending - toast1a

toast1b:0d.74 423090 4 10 spare pending - toast1b

toast1b:0d.75 423090 4 11 spare pending - toast1b

toast1b:0d.76 423090 4 12 spare pending - toast1b

toast1b:0d.77 423090 4 13 spare pending - toast1b

6 entries were displayed.

Can anyone provide insight on this problem? Why is the aggregate not reconstructing when there are spares available? NetApp stuff is not my specialty, but I'm the one who gets to deal with it, and I am pretty stumped. Thank you in advance!

--
Chris Daniel

_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters

--
Chris Daniel

--
Chris Daniel