Hello,
We've got a couple of FAS3050c filers running 4 DS14mk2 shelves full of disks (each filer is connected to two disk shelves). They've mostly been trouble-free, but this week it seems a disk failed, and our main storage aggregate went into "degraded" mode. For some reason, despite spare disks being available, it's not reconstructing as I would think it should.
The software running on these filers is Data ONTAP GX 10.0.1P2 -- from previous discussions with the community, I've learned that GX has a whole different set of commands, so many of the Google-able resources I've found aren't relevant. Adding to that difficulty, we don't have a support contract on these filers (but they are properly licensed and whathaveyou).
Here is the output of 'storage aggregate show -aggregate engdata1' (that is the degraded aggregate):
toast1a::> storage aggregate show -aggregate engdata1
Aggregate: engdata1 Size (MB): 0 Used Size (MB): 0 Used Percentage: - Available Size (MB): 0 State: restricted Nodes: toast1a Number Of Disks: 37 Disks: toast1a:0a.16, toast1a:0b.32, toast1a:0c.48, toast1a:0a.17, toast1a:0b.33, toast1a:0c.49, toast1a:0a.18, toast1a:0b.34, toast1a:0c.50, toast1a:0a.19, toast1a:0b.35, toast1a:0c.51, toast1a:0a.20, toast1a:0d.64, toast1a:0b.37, toast1a:0a.21, toast1a:0c.52, toast1a:0b.38, toast1a:0a.22, toast1a:0c.61, toast1a:0b.39, toast1a:0d.69, toast1a:0c.54, toast1a:0b.40, toast1a:0a.24, toast1a:0c.55, toast1a:0d.65, toast1a:0a.25, toast1a:0a.26, toast1a:0b.42, toast1a:0c.59, toast1a:0a.27, toast1a:0b.43, toast1a:0a.28, toast1a:0b.45, toast1a:0d.68, toast1a:0d.71 Number Of Volumes: 0 Plexes: /engdata1/plex0(online) RAID Groups: /engdata1/plex0/rg0, /engdata1/plex0/rg1, /engdata1/plex0/rg2 Raid Type: raid_dp Max RAID Size: 14 RAID Status: raid_dp,degraded Checksum Enabled: true Checksum Status: active Checksum Style: block Inconsistent: true Volume Types: flex
There are spare disks available now, but there were not when the failure occurred. I moved two spare disks to the right filer after the failure, thinking that would cause the aggregate to start reconstructing. Here is the output of 'storage disk show -state spare':
toast1a::> storage disk show -state spare Disk UsedSize(MB) Shelf Bay State RAID Type Aggregate Owner ---------------- ------------ ----- --- --------- ---------- --------- -------- toast1a:0d.72 423090 4 8 spare pending - toast1a toast1a:0d.73 423090 4 9 spare pending - toast1a toast1b:0d.74 423090 4 10 spare pending - toast1b toast1b:0d.75 423090 4 11 spare pending - toast1b toast1b:0d.76 423090 4 12 spare pending - toast1b toast1b:0d.77 423090 4 13 spare pending - toast1b 6 entries were displayed.
Can anyone provide insight on this problem? Why is the aggregate not reconstructing when there are spares available? NetApp stuff is not my specialty, but I'm the one who gets to deal with it, and I am pretty stumped. Thank you in advance!
-- Chris Daniel
Try looking at it from the node perspective: (i forget the syntax offhand as it is slightly different that Ontap 8.1+ Cluster Mode)
After you get into the node shell, do a disk show -n (make sure disks are properly assigned)
try a aggr status -r
Maybe the spares are zeroing before they are being used...that may be what pending is.
Have you checked the event log? what about the messages file?
--tmac
*Tim McCarthy* *Principal Consultant*
Clustered ONTAP Clustered ONTAP NCDA ID: XK7R3GEKC1QQ2LVD RHCE5 805007643429572 NCSIE ID: C14QPHE21FR4YWD4 Expires: 08 November 2014 Expires w/release of RHEL7 Expires: 08 November 2014
On Thu, May 9, 2013 at 4:32 PM, Chris Daniel cjdaniel@gmail.com wrote:
Hello,
We've got a couple of FAS3050c filers running 4 DS14mk2 shelves full of disks (each filer is connected to two disk shelves). They've mostly been trouble-free, but this week it seems a disk failed, and our main storage aggregate went into "degraded" mode. For some reason, despite spare disks being available, it's not reconstructing as I would think it should.
The software running on these filers is Data ONTAP GX 10.0.1P2 -- from previous discussions with the community, I've learned that GX has a whole different set of commands, so many of the Google-able resources I've found aren't relevant. Adding to that difficulty, we don't have a support contract on these filers (but they are properly licensed and whathaveyou).
Here is the output of 'storage aggregate show -aggregate engdata1' (that is the degraded aggregate):
toast1a::> storage aggregate show -aggregate engdata1
Aggregate: engdata1 Size (MB): 0 Used Size (MB): 0 Used Percentage: -
Available Size (MB): 0 State: restricted Nodes: toast1a Number Of Disks: 37 Disks: toast1a:0a.16, toast1a:0b.32, toast1a:0c.48, toast1a:0a.17, toast1a:0b.33, toast1a:0c.49, toast1a:0a.18, toast1a:0b.34, toast1a:0c.50, toast1a:0a.19, toast1a:0b.35, toast1a:0c.51, toast1a:0a.20, toast1a:0d.64, toast1a:0b.37, toast1a:0a.21, toast1a:0c.52, toast1a:0b.38, toast1a:0a.22, toast1a:0c.61, toast1a:0b.39, toast1a:0d.69, toast1a:0c.54, toast1a:0b.40, toast1a:0a.24, toast1a:0c.55, toast1a:0d.65, toast1a:0a.25, toast1a:0a.26, toast1a:0b.42, toast1a:0c.59, toast1a:0a.27, toast1a:0b.43, toast1a:0a.28, toast1a:0b.45, toast1a:0d.68, toast1a:0d.71 Number Of Volumes: 0 Plexes: /engdata1/plex0(online) RAID Groups: /engdata1/plex0/rg0, /engdata1/plex0/rg1, /engdata1/plex0/rg2 Raid Type: raid_dp Max RAID Size: 14 RAID Status: raid_dp,degraded Checksum Enabled: true Checksum Status: active Checksum Style: block Inconsistent: true Volume Types: flex
There are spare disks available now, but there were not when the failure occurred. I moved two spare disks to the right filer after the failure, thinking that would cause the aggregate to start reconstructing. Here is the output of 'storage disk show -state spare':
toast1a::> storage disk show -state spare Disk UsedSize(MB) Shelf Bay State RAID Type Aggregate Owner
toast1a:0d.72 423090 4 8 spare pending - toast1a toast1a:0d.73 423090 4 9 spare pending - toast1a toast1b:0d.74 423090 4 10 spare pending - toast1b toast1b:0d.75 423090 4 11 spare pending - toast1b toast1b:0d.76 423090 4 12 spare pending - toast1b toast1b:0d.77 423090 4 13 spare pending - toast1b 6 entries were displayed.
Can anyone provide insight on this problem? Why is the aggregate not reconstructing when there are spares available? NetApp stuff is not my specialty, but I'm the one who gets to deal with it, and I am pretty stumped. Thank you in advance!
-- Chris Daniel
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
On Thu, May 09, 2013 at 01:32:57PM -0700, Chris Daniel wrote:
The software running on these filers is Data ONTAP GX 10.0.1P2 -- from previous discussions with the community, I've learned that GX has a whole different set of commands, so many of the Google-able resources I've found aren't relevant. Adding to that difficulty, we don't have a support contract on these filers (but they are properly licensed and whathaveyou).
ONTAP GX was replaced with ONTAP 8 C-mode (Cluster Mode) serveral years back. That should help your Googling.
Can anyone provide insight on this problem? Why is the aggregate not reconstructing when there are spares available? NetApp stuff is not my specialty, but I'm the one who gets to deal with it, and I am pretty stumped. Thank you in advance!
Are the broken disks the same size as the spare disks? That is the main reason for disks not reconstructing.
Any messages in the logs (event log show)?
Have you tried doing a 'storage disk replace'?
John