Ok, I have seen the last ASUP via this EMAIL and the current disk layout. Here they are for everyone: Last known ASUP:
Aggregate aggr0 (online, raid_dp, degraded) (block checksums)
Plex /aggr0/plex0 (online, normal, active)
RAID group /aggr0/plex0/rg0 (double degraded, block checksums)
RAID Disk Device Model Number Serial Number VBN Start VBN End
--------- ------ ------------ ------------- --------- -------
dparity 0a.44 X279_HVPBP288F15 JLVBYWVC - -
parity 0a.48 X279_S15K5288F15 3LM42GHN00009839N66A - -
data 0a.49 X279_S15K5288F15 3LM1TA6C000098080J6N 0 69626751
data 0a.16 X279_HVPBP288F15 JLVJV9TC 69626752 139253503
data 0a.50 X279_HVIPB288F15 J8YMUD4C 139253504 208880255
data 0a.45 X279_S15K7288F15 6SJ5HVVK0000B24505C9 208880256 278507007
data 0a.51 X279_S15K5288F15 3LM1T7DZ00009808S8XF 348133760 417760511
data 0a.18 X279_HVPBP288F15 JLVJVENC 417760512 487387263
data 0a.52 X279_HVPBP288F15 JLVHYUNC 487387264 557014015
data 0a.19 X279_HVPBP288F15 JLVJWUUC 557014016 626640767
data 0a.20 X279_HVPBP288F15 JLVJWJ3C 626640768 696267519
data 0a.22 X279_HVPBP288F15 JLVK7NJC 765894272 835521023
data 0a.23 X279_HVPBP288F15 JLVJUE9C 835521024 905147775
data 0a.25 X279_HVPBP288F15 JLVJWKRC 905147776 974774527
RAID group /aggr0/plex0/rg1 (double degraded, block checksums)
RAID Disk Device Model Number Serial Number VBN Start VBN End
--------- ------ ------------ ------------- --------- -------
data 0a.33 X279_HVPBP288F15 JLVJUENC 974774528 1044401279
data 0a.26 X279_HVPBP288F15 JLVKVL7C 1044401280 1114028031
data 0a.34 X279_HVPBP288F15 JLVBYUWC 1114028032 1183654783
data 0a.27 X279_HVPBP288F15 JLVBZL3C 1183654784 1253281535
data 0a.35 X279_HVPBP288F15 JLVBYVYC 1253281536 1322908287
data 0a.28 X279_HVPBP288F15 JLVJV9RC 1322908288 1392535039
data 0a.36 X279_HVPBP288F15 JLVBYSBC 1392535040 1462161791
data 0a.29 X279_HVPBP288F15 JLVJWP5C 1462161792 1531788543
data 0a.37 X279_HVPBP288F15 JLVBYS1C 1531788544 1601415295
data 0a.38 X279_HVPBP288F15 JLVBYW7C 1601415296 1671042047
data 0a.39 X279_HVPBP288F15 JLVBYWWC 1671042048 1740668799
data 0a.40 X279_HVPBP288F15 JLVBYRRC 1740668800 1810295551
data 0a.41 X279_HVPBP288F15 JLVBYPWC 1810295552 1879922303
data 0a.60 X279_HVPBP288F15 JLVJ6YZC 1879922304 1949549055
RAID group /aggr0/plex0/rg2 (normal, block checksums)
RAID Disk Device Model Number Serial Number VBN Start VBN End
--------- ------ ------------ ------------- --------- -------
dparity 0a.54 X279_S15K5288F15 3LM42HL100009839NXS7 - -
parity 0a.55 X279_S15K5288F15 3LM45JAD00009839N5JZ - -
data 0a.56 X279_S15K5288F15 3LM47SPT00009840R2EL 1949549056 2019175807
data 0a.57 X279_HVIPB288F15 J8YT82MC 2019175808 2088802559
data 0a.58 X279_HVPBP288F15 JLVGNHKC 2088802560 2158429311
data 0a.59 X279_HVPBP288F15 JLVJGJVC 2158429312 2228056063
And here is the "aggr status -r" output from MAINT mode:
Aggregate aggr0 (failed, Aug 19 22:47:17 [localhost:disk.failmsg:error]: Disk 0a.21 (JLVKV9SC): non-persistent message received. 0 [NETAPP X279_HVPBP288F15 NA02] S/N [JLVKV9SC]
raid_dp, partial) (block checksums)
Aug 19 22:47:17 [localhost:disk.failmsg:error]: Disk 0a.32 (J8VYJBHC): non-persistent message received. 0 [NETAPP X279_HVIPB288F15 NA01] S/N [J8VYJBHC]
Plex /aggr0/plex0 (offline, failed, inactive)
Aug 19 22:47:17 [localhost:raid.fdr.failed.ok:info]: Disk 0a.21 Shelf 1 Bay 5 [NETAPP X279_HVPBP288F15 NA02] S/N [JLVKV9SC] successfully deleted from spare pool
RAID group /aggr0/plex0/rg0 (partial, block checksums)
Aug 19 22:47:17 [localhost:raid.fdr.failed.ok:info]: Disk 0a.32 Shelf 2 Bay 0 [NETAPP X279_HVIPB288F15 NA01] S/N [J8VYJBHC] successfully deleted from spare pool
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity FAILED N/A 272000/ -
parity 0a.48 0a 3 0 FC:A 0 FCAL 15000 272000/557056000 280104/573653840
data 0a.49 0a 3 1 FC:A 0 FCAL 15000 272000/557056000 280104/573653840
data 0a.16 0a 1 0 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.50 0a 3 2 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.45 0a 2 13 FC:A 0 FCAL 15000 272000/557056000 280104/573653840
data FAILED N/A 272000/ -
data 0a.51 0a 3 3 FC:A 0 FCAL 15000 272000/557056000 280104/573653840
data 0a.18 0a 1 2 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.52 0a 3 4 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.19 0a 1 3 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.20 0a 1 4 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data FAILED N/A 272000/ -
data 0a.22 0a 1 6 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.23 0a 1 7 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.25 0a 1 9 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
Raid group is missing 3 disks.
RAID group /aggr0/plex0/rg1 (double degraded, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity FAILED N/A 272000/ -
parity FAILED N/A 272000/ -
data 0a.33 0a 2 1 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.26 0a 1 10 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.34 0a 2 2 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.27 0a 1 11 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.35 0a 2 3 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.28 0a 1 12 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.36 0a 2 4 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.29 0a 1 13 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.37 0a 2 5 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.38 0a 2 6 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.39 0a 2 7 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.40 0a 2 8 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.41 0a 2 9 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.60 0a 3 12 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
RAID group /aggr0/plex0/rg2 (degraded, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity 0a.54 0a 3 6 FC:A 0 FCAL 15000 272000/557056000 280104/573653840
parity FAILED N/A 272000/ -
data 0a.56 0a 3 8 FC:A 0 FCAL 15000 272000/557056000 280104/573653840
data 0a.57 0a 3 9 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.58 0a 3 10 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
data 0a.59 0a 3 11 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
Unassimilated aggr0 disks
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
orphaned 0a.72 0a 4 8 FC:A 0 FCAL 15000 272000/557056000 280104/573653840
What is seemingly ODD to me is the FAILED disks and they way they are displayed. See the Highlighted entries above.
From the ASUP message we know that 0a.33 and 0a.26 are dparity and parity
but they are shifted and showing as data. In the last RIAD GROUP 0a.55 is failed and showing. Weird.
My GUT is telling me some sort of disk enumeration bug is being encountered on this version of ONTAP. Unless someone has actually been through this before, this is going to require opening a case with NetApp in some form or fashion. There may need to be some lower level disk label corrections that need to occur but I am not 100% sure on this.
Did the drives all fail at once? Was it one at a time and never replaced? just curious.
You *might* be able to get a one-time support call. If you are willing to pony up the $$ to re-enable support, they might help you out now for free. OR you can purchase a one-time support. I think I have heard is like $5k-$10k or something.
--tmac
That might just be a bug with the "aggr status -r" output, not sure, I'd have to research it more. However, the "sysconfig -r" shows rg1 correctly.
RAID group /aggr0/plex0/rg1 (double degraded, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity FAILED N/A 272000/ - parity FAILED N/A 272000/ - data 0a.33 0a 2 1 FC:A 0 FCAL 15000 272000/557056000 274845/562884296 data 0a.26 0a 1 10 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
Anyway, comparing the last ASUP to the output from maintenance mode, it looks like the last disk to fail was 0a.44. When that disk failed it took the aggr offline due to 3 failed disks in a single RAID group.
LAST ASUP: Aggregate aggr0 (online, raid_dp, degraded) (block checksums) Plex /aggr0/plex0 (online, normal, active) RAID group /aggr0/plex0/rg0 (double degraded, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0a.44 0a 2 12 FC:A 0 FCAL 15000 272000/557056000 274845/562884296 parity 0a.48 0a 3 0 FC:A 0 FCAL 15000 272000/557056000 280104/573653840 data 0a.49 0a 3 1 FC:A 0 FCAL 15000 272000/557056000 280104/573653840 data 0a.16 0a 1 0 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
MAINT MODE:
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity FAILED N/A 272000/ -
parity 0a.48 0a 3 0 FC:A 0 FCAL 15000 272000/557056000 280104/573653840
data 0a.49 0a 3 1 FC:A 0 FCAL 15000 272000/557056000 280104/573653840
data 0a.16 0a 1 0 FC:A 0 FCAL 15000 272000/557056000 274845/562884296
IF, and that's a big IF...you can unfail disk 0a.44 you might be able to get the aggr back online. Now, once the aggr is online and the controller boots up you're gonna want to have some spares in there so some reconstructs can start. I would expect disk 0a.44 to fail again at some point in the near future. Hopefully you can get it to stay online long enough for some recons to finish in rg0. Otherwise, you're looking at a panic and the controller going down again.
What's your spare situation on the partner controller? Can you assign a few to this controller? Also, do you have backups? (just in case)