Hi,
has anyone ever seen a RAID reconstruct happening immediately after an OnTap upgrade?
I just upgraded one of my older filers to 8.3.1P2 and it is now running a reconstruction on one of its aggregates for (at least to me) no obvious reason.
During the boot up of this controller after the upgrade, I saw the following message on the console which did not show up on the second controller:
Creating trace file /etc/log/rastrace/RAID_0_20170402_17:18:00:095890.dmp
No disks show as broken, or in maintenance mode, or anything like that - so any hints would be welcome.
Here's the output of `aggr status -r` on this controller:
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity 0b.22.23 0b 22 23 SA:A 0 BSAS 7200 1695466/3472315904 1695759/3472914816
parity 0b.22.4 0b 22 4 SA:A 0 BSAS 7200 1695466/3472315904 1695759/3472914816
data 2a.21.5 2a 21 5 SA:B 0 BSAS 7200 1695466/3472315904 1695759/3472914816
data 0b.22.5 0b 22 5 SA:A 0 BSAS 7200 1695466/3472315904 1695759/3472914816
data 2a.21.6 2a 21 6 SA:B 0 BSAS 7200 1695466/3472315904 1695759/3472914816
data 0b.22.6 0b 22 6 SA:A 0 BSAS 7200 1695466/3472315904 1695759/3472914816
data 2a.21.7 2a 21 7 SA:B 0 BSAS 7200 1695466/3472315904 1695759/3472914816
data 0b.22.7 0b 22 7 SA:A 0 BSAS 7200 1695466/3472315904 1695759/3472914816 (reconstruction 3% completed)
data 2a.21.8 2a 21 8 SA:B 0 BSAS 7200 1695466/3472315904 1695759/3472914816
data 0b.22.8 0b 22 8 SA:A 0 BSAS 7200 1695466/3472315904 1695759/3472914816
data 2a.21.9 2a 21 9 SA:B 0 BSAS 7200 1695466/3472315904 1695759/3472914816
data 0b.22.9 0b 22 9 SA:A 0 BSAS 7200 1695466/3472315904 1695759/3472914816
data 2a.21.10 2a 21 10 SA:B 0 BSAS 7200 1695466/3472315904 1695759/3472914816
data 0b.22.10 0b 22 10 SA:A 0 BSAS 7200 1695466/3472315904 1695759/3472914816
data 2a.21.11 2a 21 11 SA:B 0 BSAS 7200 1695466/3472315904 1695759/3472914816
data 0b.22.11 0b 22 11 SA:A 0 BSAS 7200 1695466/3472315904 1695759/3472914816
data 2a.21.12 2a 21 12 SA:B 0 BSAS 7200 1695466/3472315904 1695759/3472914816
data 0b.22.12 0b 22 12 SA:A 0 BSAS 7200 1695466/3472315904 1695759/3472914816
data 2a.21.13 2a 21 13 SA:B 0 BSAS 7200 1695466/3472315904 1695759/3472914816
Thanks,
Alexander Griesser
Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser(a)anexia-it.com<mailto:AGriesser@anexia-it.com>
Web: http://www.anexia-it.com<http://www.anexia-it.com/>
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt
Geschäftsführer: Alexander Windbichler
Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Hi All,
This is my first post, so please be gentle. Not sure if the screen grabs will work or not and I hope I have removed any IP addresses etc.
Basically we upgraded our 8.3.2 CDOT systems to 9.1P1 at the weekend (1 x Fas 2552 & 2 x FAS8040)
We then started getting the below email on every snapmirror update;
-----Original Message-----
From:
Sent: 25 March 2017 14:07
To:
Subject: ukhc3-ntap-b: smc.snapmir.update.fail [LOG_ERR]
Filer: ukhc3-ntap-b
Time: Sat, Mar 25 14:07:01 2017 +0000
Severity: LOG_ERR
Message: smc.snapmir.update.fail: Snapmirror update from source volume 'ntap-cl1-svmcifs:v_ukliss_cifs_sclive_t4r2mv' to destination volume 'ukliss-ntap-cl1-svmcifs_DR:mv_ukliss_cifs_sclive_t4r2' failed with error 'Note=Scheduled update operation is queued.'. Relationship UUID '18916a2e-9dae-11e6-b319-00a098a01c55'.
Description: This message occurs when a scheduled SnapMirror(R) transfer fails.
Action: Check the snapmirror_error log in the destination filer and take action accordingly.
Source: sm_logger_main
Index: 3572349
[Note: This email message is sent using a deprecated event routing mechanism.
For information, search the knowledgebase of the NetApp support web site for "convert existing event configurations in Data ONTAP 9.0."]
--------------------------
Yet, when checking the relationships, all appears ok and state is healthy.
Snapmirror_error log shows these "delete failed" errors;
Sun Mar 26 08:05:01 BST 2017 ScheduledUpdate[Mar 26 08:05:00]:18916a2e-9dae-11e6-b319-00a098a01c55 Operation-Uuid=46f5c6a8-2c4d-4bf5-aa7c-cf85563d0897 Group=none Operation-Cookie=0 sm_rpm_snaplist_delete:6682 Workflow ID [114] Snapshot "snapmirror.c677ecdf-3df5-11e6-86a9-00a098a01c55_2158347364.2017-03-26_070500" delete failed
Sun Mar 26 08:05:01 BST 2017 ScheduledUpdate[Mar 26 08:05:00]:098469ee-0e4d-11e7-b319-00a098a01c55 Operation-Uuid=7e139dd9-25b1-474e-9b5c-6c8ce39a88cf Group=none Operation-Cookie=0 sm_rpm_snaplist_delete:6682 Workflow ID [112] Snapshot "snapmirror.30446395-2d51-11e6-86a9-00a098a01c55_2158347374.2017-03-26_040500" delete failed
Sun Mar 26 08:05:03 BST 2017 ScheduledUpdate[Mar 26 08:05:00]:f2b650fe-a011-11e6-b319-00a098a01c55 Operation-Uuid=bfd58d04-5839-4dd8-9764-5bd5690b219b Group=none Operation-Cookie=0 sm_rpm_snaplist_delete:6682 Workflow ID [113] Snapshot "snapmirror.c677ecdf-3df5-11e6-86a9-00a098a01c55_2158347365.2017-03-26_070500" delete failed
Sun Mar 26 09:05:01 BST 2017 ScheduledUpdate[Mar 26 09:05:00]:18916a2e-9dae-11e6-b319-00a098a01c55 Operation-Uuid=3b53334a-8550-4f32-890c-f456af8ea6d5 Group=none Operation-Cookie=0 sm_rpm_snaplist_delete:6682 Workflow ID [120] Snapshot "snapmirror.c677ecdf-3df5-11e6-86a9-00a098a01c55_2158347364.2017-03-26_080500" delete failed
Sun Mar 26 09:05:03 BST 2017 ScheduledUpdate[Mar 26 09:05:00]:f2b650fe-a011-11e6-b319-00a098a01c55 Operation-Uuid=baa2cbbb-86b1-49e0-96a2-250973362b7b Group=none Operation-Cookie=0 sm_rpm_snaplist_delete:6682 Workflow ID [119] Snapshot "snapmirror.c677ecdf-3df5-11e6-86a9-00a098a01c55_2158347365.2017-03-26_080500" delete failed
This is confirmed by doing a snap list;
ukhc3::> snap list mv_ukliss_cifs_scarchive_t4r2
snapmirror.c677ecdf-3df5-11e6-86a9-00a098a01c55_2158347365.2017-03-26_080500 3.16MB 0% 0%
snapmirror.c677ecdf-3df5-11e6-86a9-00a098a01c55_2158347365.2017-03-26_090500 2.93MB 0% 0%
But the audit log confirms they are being deleted an hour later (confirmed time and zone all correct);
Sun Mar 26 09:05:00 BST 2017 ScheduledUpdate[Mar 26 09:05:00]:18916a2e-9dae-11e6-b319-00a098a01c55 Operation-Uuid=3b53334a-8550-4f32-890c-f456af8ea6d5 Group=none Operation-Cookie=0 action=Info sm_rpm_snaplist_delete([120]): Snapshot snapmirror.c677ecdf-3df5-11e6-86a9-00a098a01c55_2158347364.2017-03-26_070500 deleted.
Sun Mar 26 09:05:00 BST 2017 ScheduledUpdate[Mar 26 09:05:00]:18916a2e-9dae-11e6-b319-00a098a01c55 Operation-Uuid=3b53334a-8550-4f32-890c-f456af8ea6d5 Group=none Operation-Cookie=0 action=Info snapshot(s) up to snapmirror.c677ecdf-3df5-11e6-86a9-00a098a01c55_2158347364.2017-03-26_090500 will be selected for transfer.
Sun Mar 26 09:05:00 BST 2017 ScheduledUpdate[Mar 26 09:05:00]:18916a2e-9dae-11e6-b319-00a098a01c55 Operation-Uuid=3b53334a-8550-4f32-890c-f456af8ea6d5 Group=none Operation-Cookie=0 action=Transfer_snapshots (1 of 1 log entries): snapmirror.c677ecdf-3df5-11e6-86a9-00a098a01c55_2158347364.2017-03-26_090500
Sun Mar 26 09:05:00 BST 2017 ScheduledUpdate[Mar 26 09:05:00]:f2b650fe-a011-11e6-b319-00a098a01c55 Operation-Uuid=baa2cbbb-86b1-49e0-96a2-250973362b7b Group=none Operation-Cookie=0 action=Info sm_rpm_snaplist_delete([119]): Snapshot snapmirror.c677ecdf-3df5-11e6-86a9-00a098a01c55_2158347365.2017-03-26_070500 deleted.
Relationship is also healthy in oCUM
[cid:image001.png@01D2A616.5C0DB820]
However, this line is present in the snapmirror audit log;
Sun Mar 26 07:05:00 BST 2017 CreateSnapshotDest[Mar 26 07:05:00]:18916a2e-9dae-11e6-b319-00a098a01c55 Operation-Uuid=93851169-93b6-4067-abfb-9948c09920c8 Group=none Operation-Cookie=0 action=End source=ntap-cl1-svmcifs:v_ukliss_cifs_sclive_t4r2mv destination=ukliss-ntap-cl1-svmcifs_DR:mv_ukliss_cifs_sclive_t4r2 status=Success Note=Scheduled update operation is queued.
Success but queued?
Just to see if it was an issue with the upgrade I have created multiple test relationships;
Local Aggr1 to aggr2
In reverse
Between the 2552 & the 8040's
Between the 8040's
but getting the same issue;
[cid:image012.jpg@01D2A61B.D7A63ED0]
Indeed at this point I have come to the same conclusion that we have hit the bug: 1042534
http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1042534
[cid:image006.png@01D2A611.FF15CF80]
[cid:image007.png@01D2A611.FF15CF80]
Has anyone else experienced this? Or can offer up some useful advice as to our next steps.
Currently this system is only partially in production as a mv target, so the email spam is not too bad, but we plan to add many relationships very soon.
Kind Regards,
Chris.