Hi Toasters,
I had a pretty stunning experience on a customer's used filer installation yesterday, long story short: FSA2040 with 1xDS4243 24x1TB SATA disks, had to reinitialize the whole system because controller and shelves have been procured from different sources, so there was no root volume, etc. After I figured the disk reassignment out and have wiped both filers, they both booted up with a 3 disk aggr0 raid_dp and I could start to work on it.
I then added a bunch of disks to filer #1 and continued with the configuration, until I found out, that some of the disks would not allow the firmware to be updated. The disks in question look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.2 SA:A NETAPP X302_WVULC01TSSM 4321 WD-WCAW30821984 ff 1953525168 512 N
Working disks look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.0 SA:A NETAPP X302_WVULC01TSSM NA02 WD-WCAW31217461 ff 1953525168 512 N
I've googled a bit and found out that disks showing up with REV "4321" need to be replaced, there seems to have been a series of disks in the past with this error, so what I did was I pulled one of those disks out of the filer and replaced it with another one. The system immediately started to reconstruct the now missing filesystem disk from the spare disk, when the log started to fill up about block errors on other disks during reconstruction, then about 10 minutes later, a double reconstruct was running and about 30 minutes later, the filer paniced due to multi disk failure and that's where I ended up then.
So since there was no data on the filer, I wiped it again and am now back up and running with 3 disks in aggr0 and I'm currently replacing one of the "4321" disks with another one:
*> disk replace start 0d.11.2 0d.11.3 aggr status -r Aggregate aggr0 (online, raid_dp) (block checksums) Plex /aggr0/plex0 (online, normal, active) RAID group /aggr0/plex0/rg0 (normal, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0d.11.0 0d 11 0 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 parity 0d.11.1 0d 11 1 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 data 0d.11.2 0d 11 2 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (replacing, copy in progress) -> copy 0d.11.3 0d 11 3 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (copy 0% completed)
The question now is: since the spare disks were all properly zeroed and there were no entries in the logs that would show me block errors on these disks, how can I verify, all my spare disks are really good? I'd love to run an intensive test on all the disks in order to make sure something like that doesn't happen again when I put the filer in production.
I'm very thankful for any advice in this regard. Is disk maint used for things like that?
*> disk maint list Disk maint tests available Test index: 0 Test Id: ws Test name: Write Same Test Test index: 1 Test Id: ndst Test name: NDST Test Test index: 2 Test Id: endst Test name: Extended NDST Test Test index: 3 Test Id: vt Test name: Verify Test Test index: 4 Test Id: ss Test name: Start Stop Test Test index: 5 Test Id: dt Test name: Data Integrity Test Test index: 6 Test Id: rdt Test name: Read Test Test index: 7 Test Id: pc Test name: Power Cycle Test
Thanks,
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
BTW, just did the same on the second head in this filer, it's configured as passive node (so only 3 disks, raid4 for the root aggregate) and this seems to be happening here too now:
*> aggr status -r Aggregate aggr0 (online, raid4) (block checksums) Plex /aggr0/plex0 (online, normal, active) RAID group /aggr0/plex0/rg0 (normal, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- parity 0d.11.22 0d 11 22 SA:B - BSAS 7200 847555/1735794176 847884/1736466816 (replacing, copy in progress) -> copy 0d.11.9 0d 11 9 SA:B - BSAS 7200 847555/1735794176 847884/1736466816 (copy 3% completed) data 0d.11.23 0d 11 23 SA:B - BSAS 7200 847555/1735794176 847884/1736466816
Wed Nov 11 09:41:36 CET [:disk.ioMediumError:warning]: Medium error on disk 0d.11.23: op 0x28:746efd00:0050 sector 1953430830 SCSI:medium error - Unrecovered read error - If the disk is in a RAID group, the subsystem will attempt to reconstruct unreadable data (3 11 0 d4) (1943) [NETAPP X302_WVULC01TSSM 4321] S/N [WD-WCAW31422720] Wed Nov 11 09:41:36 CET [:disk.ioFailed:error]: I/O operation failed despite several retries.
For now, it's still running - but I'm just waiting for it to also break and since this is a raid4, the second failed disk will immediately cancel the process :-/
Best,
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Im Auftrag von Alexander Griesser Gesendet: Mittwoch, 11. November 2015 09:19 An: toasters@teaparty.net Betreff: How to verify the health of all disks?
Hi Toasters,
I had a pretty stunning experience on a customer's used filer installation yesterday, long story short: FSA2040 with 1xDS4243 24x1TB SATA disks, had to reinitialize the whole system because controller and shelves have been procured from different sources, so there was no root volume, etc. After I figured the disk reassignment out and have wiped both filers, they both booted up with a 3 disk aggr0 raid_dp and I could start to work on it.
I then added a bunch of disks to filer #1 and continued with the configuration, until I found out, that some of the disks would not allow the firmware to be updated. The disks in question look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.2 SA:A NETAPP X302_WVULC01TSSM 4321 WD-WCAW30821984 ff 1953525168 512 N
Working disks look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.0 SA:A NETAPP X302_WVULC01TSSM NA02 WD-WCAW31217461 ff 1953525168 512 N
I've googled a bit and found out that disks showing up with REV "4321" need to be replaced, there seems to have been a series of disks in the past with this error, so what I did was I pulled one of those disks out of the filer and replaced it with another one. The system immediately started to reconstruct the now missing filesystem disk from the spare disk, when the log started to fill up about block errors on other disks during reconstruction, then about 10 minutes later, a double reconstruct was running and about 30 minutes later, the filer paniced due to multi disk failure and that's where I ended up then.
So since there was no data on the filer, I wiped it again and am now back up and running with 3 disks in aggr0 and I'm currently replacing one of the "4321" disks with another one:
*> disk replace start 0d.11.2 0d.11.3 aggr status -r Aggregate aggr0 (online, raid_dp) (block checksums) Plex /aggr0/plex0 (online, normal, active) RAID group /aggr0/plex0/rg0 (normal, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0d.11.0 0d 11 0 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 parity 0d.11.1 0d 11 1 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 data 0d.11.2 0d 11 2 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (replacing, copy in progress) -> copy 0d.11.3 0d 11 3 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (copy 0% completed)
The question now is: since the spare disks were all properly zeroed and there were no entries in the logs that would show me block errors on these disks, how can I verify, all my spare disks are really good? I'd love to run an intensive test on all the disks in order to make sure something like that doesn't happen again when I put the filer in production.
I'm very thankful for any advice in this regard. Is disk maint used for things like that?
*> disk maint list Disk maint tests available Test index: 0 Test Id: ws Test name: Write Same Test Test index: 1 Test Id: ndst Test name: NDST Test Test index: 2 Test Id: endst Test name: Extended NDST Test Test index: 3 Test Id: vt Test name: Verify Test Test index: 4 Test Id: ss Test name: Start Stop Test Test index: 5 Test Id: dt Test name: Data Integrity Test Test index: 6 Test Id: rdt Test name: Read Test Test index: 7 Test Id: pc Test name: Power Cycle Test
Thanks,
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Hi Alexander,
Id suggest doing so.
Dont forget to issue a disk zero spares afterwards
br
Josef
Mit freundlichen Grüßen/Best Regards
Josef Kropf (Senior System Administrator) Lyoness Group AG
Von: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Im Auftrag von Alexander Griesser Gesendet: Mittwoch, 11. November 2015 09:19 An: toasters@teaparty.net Betreff: How to verify the health of all disks?
Hi Toasters,
I had a pretty stunning experience on a customers used filer installation yesterday, long story short:
FSA2040 with 1xDS4243 24x1TB SATA disks, had to reinitialize the whole system because controller and shelves have been procured from different sources, so there was no root volume, etc.
After I figured the disk reassignment out and have wiped both filers, they both booted up with a 3 disk aggr0 raid_dp and I could start to work on it.
I then added a bunch of disks to filer #1 and continued with the configuration, until I found out, that some of the disks would not allow the firmware to be updated. The disks in question look like this:
*> disk_list
DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ
------------ ----- -------- ---------------- ---- -------------------- -- -------------- --
0d.11.2 SA:A NETAPP X302_WVULC01TSSM 4321 WD-WCAW30821984 ff 1953525168 512 N
Working disks look like this:
*> disk_list
DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ
------------ ----- -------- ---------------- ---- -------------------- -- -------------- --
0d.11.0 SA:A NETAPP X302_WVULC01TSSM NA02 WD-WCAW31217461 ff 1953525168 512 N
Ive googled a bit and found out that disks showing up with REV 4321 need to be replaced, there seems to have been a series of disks in the past with this error, so what I did was I pulled one of those disks out of the filer and replaced it with another one.
The system immediately started to reconstruct the now missing filesystem disk from the spare disk, when the log started to fill up about block errors on other disks during reconstruction, then about 10 minutes later, a double reconstruct was running and about 30 minutes later, the filer paniced due to multi disk failure and thats where I ended up then.
So since there was no data on the filer, I wiped it again and am now back up and running with 3 disks in aggr0 and Im currently replacing one of the 4321 disks with another one:
*> disk replace start 0d.11.2 0d.11.3
aggr status -r
Aggregate aggr0 (online, raid_dp) (block checksums)
Plex /aggr0/plex0 (online, normal, active)
RAID group /aggr0/plex0/rg0 (normal, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity 0d.11.0 0d 11 0 SA:A - BSAS 7200 847555/1735794176 847884/1736466816
parity 0d.11.1 0d 11 1 SA:A - BSAS 7200 847555/1735794176 847884/1736466816
data 0d.11.2 0d 11 2 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (replacing, copy in progress)
-> copy 0d.11.3 0d 11 3 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (copy 0% completed)
The question now is: since the spare disks were all properly zeroed and there were no entries in the logs that would show me block errors on these disks, how can I verify, all my spare disks are really good? Id love to run an intensive test on all the disks in order to make sure something like that doesnt happen again when I put the filer in production.
Im very thankful for any advice in this regard.
Is disk maint used for things like that?
*> disk maint list
Disk maint tests available
Test index: 0 Test Id: ws Test name: Write Same Test
Test index: 1 Test Id: ndst Test name: NDST Test
Test index: 2 Test Id: endst Test name: Extended NDST Test
Test index: 3 Test Id: vt Test name: Verify Test
Test index: 4 Test Id: ss Test name: Start Stop Test
Test index: 5 Test Id: dt Test name: Data Integrity Test
Test index: 6 Test Id: rdt Test name: Read Test
Test index: 7 Test Id: pc Test name: Power Cycle Test
Thanks,
Alexander Griesser
Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: mailto:AGriesser@anexia-it.com AGriesser@anexia-it.com
Web: http://www.anexia-it.com http://www.anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt
Geschäftsführer: Alexander Windbichler
Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Any specific tests that I should run here? Or just all at once? Will that impact my current filesystem disks or should I run it on the spares first?
Best,
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Im Auftrag von Josef Kropf Gesendet: Mittwoch, 11. November 2015 09:57 An: toasters@teaparty.net Betreff: AW: How to verify the health of all disks?
Hi Alexander,
I'd suggest doing so.
Don't forget to issue a "disk zero spares" afterwards
br Josef
Mit freundlichen Grüßen/Best Regards
Josef Kropf (Senior System Administrator) Lyoness Group AG
Von: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Im Auftrag von Alexander Griesser Gesendet: Mittwoch, 11. November 2015 09:19 An: toasters@teaparty.netmailto:toasters@teaparty.net Betreff: How to verify the health of all disks?
Hi Toasters,
I had a pretty stunning experience on a customer's used filer installation yesterday, long story short: FSA2040 with 1xDS4243 24x1TB SATA disks, had to reinitialize the whole system because controller and shelves have been procured from different sources, so there was no root volume, etc. After I figured the disk reassignment out and have wiped both filers, they both booted up with a 3 disk aggr0 raid_dp and I could start to work on it.
I then added a bunch of disks to filer #1 and continued with the configuration, until I found out, that some of the disks would not allow the firmware to be updated. The disks in question look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.2 SA:A NETAPP X302_WVULC01TSSM 4321 WD-WCAW30821984 ff 1953525168 512 N
Working disks look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.0 SA:A NETAPP X302_WVULC01TSSM NA02 WD-WCAW31217461 ff 1953525168 512 N
I've googled a bit and found out that disks showing up with REV "4321" need to be replaced, there seems to have been a series of disks in the past with this error, so what I did was I pulled one of those disks out of the filer and replaced it with another one. The system immediately started to reconstruct the now missing filesystem disk from the spare disk, when the log started to fill up about block errors on other disks during reconstruction, then about 10 minutes later, a double reconstruct was running and about 30 minutes later, the filer paniced due to multi disk failure and that's where I ended up then.
So since there was no data on the filer, I wiped it again and am now back up and running with 3 disks in aggr0 and I'm currently replacing one of the "4321" disks with another one:
*> disk replace start 0d.11.2 0d.11.3 aggr status -r Aggregate aggr0 (online, raid_dp) (block checksums) Plex /aggr0/plex0 (online, normal, active) RAID group /aggr0/plex0/rg0 (normal, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0d.11.0 0d 11 0 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 parity 0d.11.1 0d 11 1 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 data 0d.11.2 0d 11 2 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (replacing, copy in progress) -> copy 0d.11.3 0d 11 3 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (copy 0% completed)
The question now is: since the spare disks were all properly zeroed and there were no entries in the logs that would show me block errors on these disks, how can I verify, all my spare disks are really good? I'd love to run an intensive test on all the disks in order to make sure something like that doesn't happen again when I put the filer in production.
I'm very thankful for any advice in this regard. Is disk maint used for things like that?
*> disk maint list Disk maint tests available Test index: 0 Test Id: ws Test name: Write Same Test Test index: 1 Test Id: ndst Test name: NDST Test Test index: 2 Test Id: endst Test name: Extended NDST Test Test index: 3 Test Id: vt Test name: Verify Test Test index: 4 Test Id: ss Test name: Start Stop Test Test index: 5 Test Id: dt Test name: Data Integrity Test Test index: 6 Test Id: rdt Test name: Read Test Test index: 7 Test Id: pc Test name: Power Cycle Test
Thanks,
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
You need to update drive firmware before continuing (probably in conjunction with DOT upgrade). See http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=853138.
--- With best regards
Andrei Borzenkov Senior system engineer FTS WEMEAI RUC RU SC TMS FOS [cid:image001.gif@01CBF835.B3FEDA90] FUJITSU Zemlyanoy Val Street, 9, 105 064 Moscow, Russian Federation Tel.: +7 495 730 62 20 ( reception) Mob.: +7 916 678 7208 Fax: +7 495 730 62 14 E-mail: Andrei.Borzenkov@ts.fujitsu.commailto:Andrei.Borzenkov@ts.fujitsu.com Web: ru.fujitsu.comhttp://ts.fujitsu.com/ Company details: ts.fujitsu.com/imprinthttp://ts.fujitsu.com/imprint.html This communication contains information that is confidential, proprietary in nature and/or privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) or the person responsible for delivering it to the intended recipient(s), please note that any form of dissemination, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender and delete the original communication. Thank you for your cooperation. Please be advised that neither Fujitsu, its affiliates, its employees or agents accept liability for any errors, omissions or damages caused by delays of receipt or by any virus infection in this message or its attachments, or which may otherwise arise as a result of this e-mail transmission.
From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Alexander Griesser Sent: Wednesday, November 11, 2015 11:19 AM To: toasters@teaparty.net Subject: How to verify the health of all disks?
Hi Toasters,
I had a pretty stunning experience on a customer’s used filer installation yesterday, long story short: FSA2040 with 1xDS4243 24x1TB SATA disks, had to reinitialize the whole system because controller and shelves have been procured from different sources, so there was no root volume, etc. After I figured the disk reassignment out and have wiped both filers, they both booted up with a 3 disk aggr0 raid_dp and I could start to work on it.
I then added a bunch of disks to filer #1 and continued with the configuration, until I found out, that some of the disks would not allow the firmware to be updated. The disks in question look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.2 SA:A NETAPP X302_WVULC01TSSM 4321 WD-WCAW30821984 ff 1953525168 512 N
Working disks look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.0 SA:A NETAPP X302_WVULC01TSSM NA02 WD-WCAW31217461 ff 1953525168 512 N
I’ve googled a bit and found out that disks showing up with REV „4321“ need to be replaced, there seems to have been a series of disks in the past with this error, so what I did was I pulled one of those disks out of the filer and replaced it with another one. The system immediately started to reconstruct the now missing filesystem disk from the spare disk, when the log started to fill up about block errors on other disks during reconstruction, then about 10 minutes later, a double reconstruct was running and about 30 minutes later, the filer paniced due to multi disk failure and that’s where I ended up then.
So since there was no data on the filer, I wiped it again and am now back up and running with 3 disks in aggr0 and I’m currently replacing one of the „4321“ disks with another one:
*> disk replace start 0d.11.2 0d.11.3 aggr status -r Aggregate aggr0 (online, raid_dp) (block checksums) Plex /aggr0/plex0 (online, normal, active) RAID group /aggr0/plex0/rg0 (normal, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0d.11.0 0d 11 0 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 parity 0d.11.1 0d 11 1 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 data 0d.11.2 0d 11 2 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (replacing, copy in progress) -> copy 0d.11.3 0d 11 3 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (copy 0% completed)
The question now is: since the spare disks were all properly zeroed and there were no entries in the logs that would show me block errors on these disks, how can I verify, all my spare disks are really good? I’d love to run an intensive test on all the disks in order to make sure something like that doesn’t happen again when I put the filer in production.
I’m very thankful for any advice in this regard. Is disk maint used for things like that?
*> disk maint list Disk maint tests available Test index: 0 Test Id: ws Test name: Write Same Test Test index: 1 Test Id: ndst Test name: NDST Test Test index: 2 Test Id: endst Test name: Extended NDST Test Test index: 3 Test Id: vt Test name: Verify Test Test index: 4 Test Id: ss Test name: Start Stop Test Test index: 5 Test Id: dt Test name: Data Integrity Test Test index: 6 Test Id: rdt Test name: Read Test Test index: 7 Test Id: pc Test name: Power Cycle Test
Thanks,
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Hi,
the disks which are showing 4321 as revision won't update, all other disks got updated perfectly fine. I've upgraded the HA pair to 8.1.4P9 (which is the latest support on this platform, AFAIK). Any idea how I can force the disk firmware update on those disks? Maybe in maintenance mode?
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.comhttp://www.anexia-it.com/
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: andrei.borzenkov@ts.fujitsu.com [mailto:andrei.borzenkov@ts.fujitsu.com] Gesendet: Mittwoch, 11. November 2015 11:41 An: Alexander Griesser AGriesser@anexia-it.com; toasters@teaparty.net Betreff: RE: How to verify the health of all disks?
You need to update drive firmware before continuing (probably in conjunction with DOT upgrade). See http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=853138.
--- With best regards
Andrei Borzenkov Senior system engineer FTS WEMEAI RUC RU SC TMS FOS [cid:image001.gif@01CBF835.B3FEDA90] FUJITSU Zemlyanoy Val Street, 9, 105 064 Moscow, Russian Federation Tel.: +7 495 730 62 20 ( reception) Mob.: +7 916 678 7208 Fax: +7 495 730 62 14 E-mail: Andrei.Borzenkov@ts.fujitsu.commailto:Andrei.Borzenkov@ts.fujitsu.com Web: ru.fujitsu.comhttp://ts.fujitsu.com/ Company details: ts.fujitsu.com/imprinthttp://ts.fujitsu.com/imprint.html This communication contains information that is confidential, proprietary in nature and/or privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) or the person responsible for delivering it to the intended recipient(s), please note that any form of dissemination, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender and delete the original communication. Thank you for your cooperation. Please be advised that neither Fujitsu, its affiliates, its employees or agents accept liability for any errors, omissions or damages caused by delays of receipt or by any virus infection in this message or its attachments, or which may otherwise arise as a result of this e-mail transmission.
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Alexander Griesser Sent: Wednesday, November 11, 2015 11:19 AM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: How to verify the health of all disks?
Hi Toasters,
I had a pretty stunning experience on a customer's used filer installation yesterday, long story short: FSA2040 with 1xDS4243 24x1TB SATA disks, had to reinitialize the whole system because controller and shelves have been procured from different sources, so there was no root volume, etc. After I figured the disk reassignment out and have wiped both filers, they both booted up with a 3 disk aggr0 raid_dp and I could start to work on it.
I then added a bunch of disks to filer #1 and continued with the configuration, until I found out, that some of the disks would not allow the firmware to be updated. The disks in question look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.2 SA:A NETAPP X302_WVULC01TSSM 4321 WD-WCAW30821984 ff 1953525168 512 N
Working disks look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.0 SA:A NETAPP X302_WVULC01TSSM NA02 WD-WCAW31217461 ff 1953525168 512 N
I've googled a bit and found out that disks showing up with REV "4321" need to be replaced, there seems to have been a series of disks in the past with this error, so what I did was I pulled one of those disks out of the filer and replaced it with another one. The system immediately started to reconstruct the now missing filesystem disk from the spare disk, when the log started to fill up about block errors on other disks during reconstruction, then about 10 minutes later, a double reconstruct was running and about 30 minutes later, the filer paniced due to multi disk failure and that's where I ended up then.
So since there was no data on the filer, I wiped it again and am now back up and running with 3 disks in aggr0 and I'm currently replacing one of the "4321" disks with another one:
*> disk replace start 0d.11.2 0d.11.3 aggr status -r Aggregate aggr0 (online, raid_dp) (block checksums) Plex /aggr0/plex0 (online, normal, active) RAID group /aggr0/plex0/rg0 (normal, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0d.11.0 0d 11 0 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 parity 0d.11.1 0d 11 1 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 data 0d.11.2 0d 11 2 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (replacing, copy in progress) -> copy 0d.11.3 0d 11 3 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (copy 0% completed)
The question now is: since the spare disks were all properly zeroed and there were no entries in the logs that would show me block errors on these disks, how can I verify, all my spare disks are really good? I'd love to run an intensive test on all the disks in order to make sure something like that doesn't happen again when I put the filer in production.
I'm very thankful for any advice in this regard. Is disk maint used for things like that?
*> disk maint list Disk maint tests available Test index: 0 Test Id: ws Test name: Write Same Test Test index: 1 Test Id: ndst Test name: NDST Test Test index: 2 Test Id: endst Test name: Extended NDST Test Test index: 3 Test Id: vt Test name: Verify Test Test index: 4 Test Id: ss Test name: Start Stop Test Test index: 5 Test Id: dt Test name: Data Integrity Test Test index: 6 Test Id: rdt Test name: Read Test Test index: 7 Test Id: pc Test name: Power Cycle Test
Thanks,
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Did you assign the disk to the storage?
Issue a disk show just 2 be shure
Von: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Im Auftrag von Alexander Griesser Gesendet: Mittwoch, 11. November 2015 11:43 An: andrei.borzenkov@ts.fujitsu.com; toasters@teaparty.net Betreff: AW: How to verify the health of all disks?
Hi,
the disks which are showing 4321 as revision wont update, all other disks got updated perfectly fine. Ive upgraded the HA pair to 8.1.4P9 (which is the latest support on this platform, AFAIK).
Any idea how I can force the disk firmware update on those disks? Maybe in maintenance mode?
Alexander Griesser
Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: mailto:AGriesser@anexia-it.com AGriesser@anexia-it.com
Web: http://www.anexia-it.com/ http://www.anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt
Geschäftsführer: Alexander Windbichler
Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: andrei.borzenkov@ts.fujitsu.com [mailto:andrei.borzenkov@ts.fujitsu.com] Gesendet: Mittwoch, 11. November 2015 11:41 An: Alexander Griesser AGriesser@anexia-it.com; toasters@teaparty.net Betreff: RE: How to verify the health of all disks?
You need to update drive firmware before continuing (probably in conjunction with DOT upgrade). See http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=853138 &Display=853138.
---
With best regards
Andrei Borzenkov
Senior system engineer
FTS WEMEAI RUC RU SC TMS FOS
cid:image001.gif@01CBF835.B3FEDA90
FUJITSU
Zemlyanoy Val Street, 9, 105 064 Moscow, Russian Federation
Tel.: +7 495 730 62 20 ( reception)
Mob.: +7 916 678 7208
Fax: +7 495 730 62 14
E-mail: mailto:Andrei.Borzenkov@ts.fujitsu.com Andrei.Borzenkov@ts.fujitsu.com
Web: http://ts.fujitsu.com/ ru.fujitsu.com
Company details: http://ts.fujitsu.com/imprint.html ts.fujitsu.com/imprint
This communication contains information that is confidential, proprietary in nature and/or privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) or the person responsible for delivering it to the intended recipient(s), please note that any form of dissemination, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender and delete the original communication. Thank you for your cooperation.
Please be advised that neither Fujitsu, its affiliates, its employees or agents accept liability for any errors, omissions or damages caused by delays of receipt or by any virus infection in this message or its attachments, or which may otherwise arise as a result of this e-mail transmission.
From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Alexander Griesser Sent: Wednesday, November 11, 2015 11:19 AM To: toasters@teaparty.net Subject: How to verify the health of all disks?
Hi Toasters,
I had a pretty stunning experience on a customers used filer installation yesterday, long story short:
FSA2040 with 1xDS4243 24x1TB SATA disks, had to reinitialize the whole system because controller and shelves have been procured from different sources, so there was no root volume, etc.
After I figured the disk reassignment out and have wiped both filers, they both booted up with a 3 disk aggr0 raid_dp and I could start to work on it.
I then added a bunch of disks to filer #1 and continued with the configuration, until I found out, that some of the disks would not allow the firmware to be updated. The disks in question look like this:
*> disk_list
DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ
------------ ----- -------- ---------------- ---- -------------------- -- -------------- --
0d.11.2 SA:A NETAPP X302_WVULC01TSSM 4321 WD-WCAW30821984 ff 1953525168 512 N
Working disks look like this:
*> disk_list
DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ
------------ ----- -------- ---------------- ---- -------------------- -- -------------- --
0d.11.0 SA:A NETAPP X302_WVULC01TSSM NA02 WD-WCAW31217461 ff 1953525168 512 N
Ive googled a bit and found out that disks showing up with REV 4321 need to be replaced, there seems to have been a series of disks in the past with this error, so what I did was I pulled one of those disks out of the filer and replaced it with another one.
The system immediately started to reconstruct the now missing filesystem disk from the spare disk, when the log started to fill up about block errors on other disks during reconstruction, then about 10 minutes later, a double reconstruct was running and about 30 minutes later, the filer paniced due to multi disk failure and thats where I ended up then.
So since there was no data on the filer, I wiped it again and am now back up and running with 3 disks in aggr0 and Im currently replacing one of the 4321 disks with another one:
*> disk replace start 0d.11.2 0d.11.3
aggr status -r
Aggregate aggr0 (online, raid_dp) (block checksums)
Plex /aggr0/plex0 (online, normal, active)
RAID group /aggr0/plex0/rg0 (normal, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity 0d.11.0 0d 11 0 SA:A - BSAS 7200 847555/1735794176 847884/1736466816
parity 0d.11.1 0d 11 1 SA:A - BSAS 7200 847555/1735794176 847884/1736466816
data 0d.11.2 0d 11 2 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (replacing, copy in progress)
-> copy 0d.11.3 0d 11 3 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (copy 0% completed)
The question now is: since the spare disks were all properly zeroed and there were no entries in the logs that would show me block errors on these disks, how can I verify, all my spare disks are really good? Id love to run an intensive test on all the disks in order to make sure something like that doesnt happen again when I put the filer in production.
Im very thankful for any advice in this regard.
Is disk maint used for things like that?
*> disk maint list
Disk maint tests available
Test index: 0 Test Id: ws Test name: Write Same Test
Test index: 1 Test Id: ndst Test name: NDST Test
Test index: 2 Test Id: endst Test name: Extended NDST Test
Test index: 3 Test Id: vt Test name: Verify Test
Test index: 4 Test Id: ss Test name: Start Stop Test
Test index: 5 Test Id: dt Test name: Data Integrity Test
Test index: 6 Test Id: rdt Test name: Read Test
Test index: 7 Test Id: pc Test name: Power Cycle Test
Thanks,
Alexander Griesser
Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: mailto:AGriesser@anexia-it.com AGriesser@anexia-it.com
Web: http://www.anexia-it.com http://www.anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt
Geschäftsführer: Alexander Windbichler
Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Yes, they're properly assigned.
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.comhttp://www.anexia-it.com/
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Im Auftrag von Josef Kropf Gesendet: Mittwoch, 11. November 2015 11:56 An: toasters@teaparty.net Betreff: AW: How to verify the health of all disks?
Did you assign the disk to the storage?
Issue a "disk show" just 2 be shure...
Von: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Im Auftrag von Alexander Griesser Gesendet: Mittwoch, 11. November 2015 11:43 An: andrei.borzenkov@ts.fujitsu.commailto:andrei.borzenkov@ts.fujitsu.com; toasters@teaparty.netmailto:toasters@teaparty.net Betreff: AW: How to verify the health of all disks?
Hi,
the disks which are showing 4321 as revision won't update, all other disks got updated perfectly fine. I've upgraded the HA pair to 8.1.4P9 (which is the latest support on this platform, AFAIK). Any idea how I can force the disk firmware update on those disks? Maybe in maintenance mode?
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.comhttp://www.anexia-it.com/
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: andrei.borzenkov@ts.fujitsu.commailto:andrei.borzenkov@ts.fujitsu.com [mailto:andrei.borzenkov@ts.fujitsu.com] Gesendet: Mittwoch, 11. November 2015 11:41 An: Alexander Griesser <AGriesser@anexia-it.commailto:AGriesser@anexia-it.com>; toasters@teaparty.netmailto:toasters@teaparty.net Betreff: RE: How to verify the health of all disks?
You need to update drive firmware before continuing (probably in conjunction with DOT upgrade). See http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=853138.
--- With best regards
Andrei Borzenkov Senior system engineer FTS WEMEAI RUC RU SC TMS FOS [cid:image001.gif@01CBF835.B3FEDA90] FUJITSU Zemlyanoy Val Street, 9, 105 064 Moscow, Russian Federation Tel.: +7 495 730 62 20 ( reception) Mob.: +7 916 678 7208 Fax: +7 495 730 62 14 E-mail: Andrei.Borzenkov@ts.fujitsu.commailto:Andrei.Borzenkov@ts.fujitsu.com Web: ru.fujitsu.comhttp://ts.fujitsu.com/ Company details: ts.fujitsu.com/imprinthttp://ts.fujitsu.com/imprint.html This communication contains information that is confidential, proprietary in nature and/or privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) or the person responsible for delivering it to the intended recipient(s), please note that any form of dissemination, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender and delete the original communication. Thank you for your cooperation. Please be advised that neither Fujitsu, its affiliates, its employees or agents accept liability for any errors, omissions or damages caused by delays of receipt or by any virus infection in this message or its attachments, or which may otherwise arise as a result of this e-mail transmission.
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Alexander Griesser Sent: Wednesday, November 11, 2015 11:19 AM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: How to verify the health of all disks?
Hi Toasters,
I had a pretty stunning experience on a customer's used filer installation yesterday, long story short: FSA2040 with 1xDS4243 24x1TB SATA disks, had to reinitialize the whole system because controller and shelves have been procured from different sources, so there was no root volume, etc. After I figured the disk reassignment out and have wiped both filers, they both booted up with a 3 disk aggr0 raid_dp and I could start to work on it.
I then added a bunch of disks to filer #1 and continued with the configuration, until I found out, that some of the disks would not allow the firmware to be updated. The disks in question look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.2 SA:A NETAPP X302_WVULC01TSSM 4321 WD-WCAW30821984 ff 1953525168 512 N
Working disks look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.0 SA:A NETAPP X302_WVULC01TSSM NA02 WD-WCAW31217461 ff 1953525168 512 N
I've googled a bit and found out that disks showing up with REV "4321" need to be replaced, there seems to have been a series of disks in the past with this error, so what I did was I pulled one of those disks out of the filer and replaced it with another one. The system immediately started to reconstruct the now missing filesystem disk from the spare disk, when the log started to fill up about block errors on other disks during reconstruction, then about 10 minutes later, a double reconstruct was running and about 30 minutes later, the filer paniced due to multi disk failure and that's where I ended up then.
So since there was no data on the filer, I wiped it again and am now back up and running with 3 disks in aggr0 and I'm currently replacing one of the "4321" disks with another one:
*> disk replace start 0d.11.2 0d.11.3 aggr status -r Aggregate aggr0 (online, raid_dp) (block checksums) Plex /aggr0/plex0 (online, normal, active) RAID group /aggr0/plex0/rg0 (normal, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0d.11.0 0d 11 0 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 parity 0d.11.1 0d 11 1 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 data 0d.11.2 0d 11 2 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (replacing, copy in progress) -> copy 0d.11.3 0d 11 3 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (copy 0% completed)
The question now is: since the spare disks were all properly zeroed and there were no entries in the logs that would show me block errors on these disks, how can I verify, all my spare disks are really good? I'd love to run an intensive test on all the disks in order to make sure something like that doesn't happen again when I put the filer in production.
I'm very thankful for any advice in this regard. Is disk maint used for things like that?
*> disk maint list Disk maint tests available Test index: 0 Test Id: ws Test name: Write Same Test Test index: 1 Test Id: ndst Test name: NDST Test Test index: 2 Test Id: endst Test name: Extended NDST Test Test index: 3 Test Id: vt Test name: Verify Test Test index: 4 Test Id: ss Test name: Start Stop Test Test index: 5 Test Id: dt Test name: Data Integrity Test Test index: 6 Test Id: rdt Test name: Read Test Test index: 7 Test Id: pc Test name: Power Cycle Test
Thanks,
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
I've found the tool called hammer also quite effective at shaking out bad disks. Run it for 3 days full bore. On Nov 11, 2015 6:30 AM, "Alexander Griesser" AGriesser@anexia-it.com wrote:
Yes, they’re properly assigned.
*Alexander Griesser*
Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt
Geschäftsführer: Alexander Windbichler
Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
*Von:* toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] *Im Auftrag von *Josef Kropf *Gesendet:* Mittwoch, 11. November 2015 11:56 *An:* toasters@teaparty.net *Betreff:* AW: How to verify the health of all disks?
Did you assign the disk to the storage?
Issue a „disk show“ just 2 be shure…
*Von:* toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net toasters-bounces@teaparty.net] *Im Auftrag von *Alexander Griesser *Gesendet:* Mittwoch, 11. November 2015 11:43 *An:* andrei.borzenkov@ts.fujitsu.com; toasters@teaparty.net *Betreff:* AW: How to verify the health of all disks?
Hi,
the disks which are showing 4321 as revision won’t update, all other disks got updated perfectly fine. I’ve upgraded the HA pair to 8.1.4P9 (which is the latest support on this platform, AFAIK).
Any idea how I can force the disk firmware update on those disks? Maybe in maintenance mode?
*Alexander Griesser*
Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt
Geschäftsführer: Alexander Windbichler
Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
*Von:* andrei.borzenkov@ts.fujitsu.com [ mailto:andrei.borzenkov@ts.fujitsu.com andrei.borzenkov@ts.fujitsu.com] *Gesendet:* Mittwoch, 11. November 2015 11:41 *An:* Alexander Griesser AGriesser@anexia-it.com; toasters@teaparty.net *Betreff:* RE: How to verify the health of all disks?
You need to update drive firmware before continuing (probably in conjunction with DOT upgrade). See http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=853138.
With best regards
*Andre**i** Borzenkov*
Senior system engineer
FTS WEMEAI RUC RU SC TMS FOS
[image: cid:image001.gif@01CBF835.B3FEDA90]
*FUJITSU*
Zemlyanoy Val Street, 9, 105 064 Moscow, Russian Federation
Tel.: +7 495 730 62 20 ( reception)
Mob.: +7 916 678 7208
Fax: +7 495 730 62 14
E-mail: Andrei.Borzenkov@ts.fujitsu.com
Web: ru.fujitsu.com http://ts.fujitsu.com/
Company details: ts.fujitsu.com/imprint http://ts.fujitsu.com/imprint.html
This communication contains information that is confidential, proprietary in nature and/or privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) or the person responsible for delivering it to the intended recipient(s), please note that any form of dissemination, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender and delete the original communication. Thank you for your cooperation.
Please be advised that neither Fujitsu, its affiliates, its employees or agents accept liability for any errors, omissions or damages caused by delays of receipt or by any virus infection in this message or its attachments, or which may otherwise arise as a result of this e-mail transmission.
*From:* toasters-bounces@teaparty.net [ mailto:toasters-bounces@teaparty.net toasters-bounces@teaparty.net] *On Behalf Of *Alexander Griesser *Sent:* Wednesday, November 11, 2015 11:19 AM *To:* toasters@teaparty.net *Subject:* How to verify the health of all disks?
Hi Toasters,
I had a pretty stunning experience on a customer’s used filer installation yesterday, long story short:
FSA2040 with 1xDS4243 24x1TB SATA disks, had to reinitialize the whole system because controller and shelves have been procured from different sources, so there was no root volume, etc.
After I figured the disk reassignment out and have wiped both filers, they both booted up with a 3 disk aggr0 raid_dp and I could start to work on it.
I then added a bunch of disks to filer #1 and continued with the configuration, until I found out, that some of the disks would not allow the firmware to be updated. The disks in question look like this:
*> disk_list
DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW
(BLOCKS BPS) DQ
0d.11.2 SA:A NETAPP X302_WVULC01TSSM 4321 WD-WCAW30821984 ff 1953525168 512 N
Working disks look like this:
*> disk_list
DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW
(BLOCKS BPS) DQ
0d.11.0 SA:A NETAPP X302_WVULC01TSSM NA02 WD-WCAW31217461 ff 1953525168 512 N
I’ve googled a bit and found out that disks showing up with REV „4321“ need to be replaced, there seems to have been a series of disks in the past with this error, so what I did was I pulled one of those disks out of the filer and replaced it with another one.
The system immediately started to reconstruct the now missing filesystem disk from the spare disk, when the log started to fill up about block errors on other disks during reconstruction, then about 10 minutes later, a double reconstruct was running and about 30 minutes later, the filer paniced due to multi disk failure and that’s where I ended up then.
So since there was no data on the filer, I wiped it again and am now back up and running with 3 disks in aggr0 and I’m currently replacing one of the „4321“ disks with another one:
*> disk replace start 0d.11.2 0d.11.3
aggr status -r
Aggregate aggr0 (online, raid_dp) (block checksums)
Plex /aggr0/plex0 (online, normal, active)
RAID group /aggr0/plex0/rg0 (normal, block checksums) RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used
(MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- -----
dparity 0d.11.0 0d 11 0 SA:A - BSAS 7200
847555/1735794176 847884/1736466816
parity 0d.11.1 0d 11 1 SA:A - BSAS 7200
847555/1735794176 847884/1736466816
data 0d.11.2 0d 11 2 SA:A - BSAS 7200
847555/1735794176 847884/1736466816 (replacing, copy in progress)
-> copy 0d.11.3 0d 11 3 SA:A - BSAS 7200
847555/1735794176 847884/1736466816 (copy 0% completed)
The question now is: since the spare disks were all properly zeroed and there were no entries in the logs that would show me block errors on these disks, how can I verify, all my spare disks are really good? I’d love to run an intensive test on all the disks in order to make sure something like that doesn’t happen again when I put the filer in production.
I’m very thankful for any advice in this regard.
Is disk maint used for things like that?
*> disk maint list
Disk maint tests available
Test index: 0 Test Id: ws Test name: Write Same Test
Test index: 1 Test Id: ndst Test name: NDST Test
Test index: 2 Test Id: endst Test name: Extended NDST Test
Test index: 3 Test Id: vt Test name: Verify Test
Test index: 4 Test Id: ss Test name: Start Stop Test
Test index: 5 Test Id: dt Test name: Data Integrity Test
Test index: 6 Test Id: rdt Test name: Read Test
Test index: 7 Test Id: pc Test name: Power Cycle Test
Thanks,
*Alexander Griesser*
Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt
Geschäftsführer: Alexander Windbichler
Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Hammer stresses the whole box to 100%
http://rajeev.name/2008/09/15/ontap-73-hammer/
Von: Douglas Siggins [mailto:siggins@gmail.com] Gesendet: Mittwoch, 11. November 2015 13:27 An: Alexander Griesser Cc: toasters@teaparty.net; Josef Kropf Betreff: Re: AW: How to verify the health of all disks?
I've found the tool called hammer also quite effective at shaking out bad disks. Run it for 3 days full bore.
On Nov 11, 2015 6:30 AM, "Alexander Griesser" AGriesser@anexia-it.com wrote:
Yes, they’re properly assigned.
Alexander Griesser
Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.com
Web: http://www.anexia-it.com http://www.anexia-it.com/
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt
Geschäftsführer: Alexander Windbichler
Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Im Auftrag von Josef Kropf Gesendet: Mittwoch, 11. November 2015 11:56 An: toasters@teaparty.net Betreff: AW: How to verify the health of all disks?
Did you assign the disk to the storage?
Issue a „disk show“ just 2 be shure…
Von: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Im Auftrag von Alexander Griesser Gesendet: Mittwoch, 11. November 2015 11:43 An: andrei.borzenkov@ts.fujitsu.com; toasters@teaparty.net Betreff: AW: How to verify the health of all disks?
Hi,
the disks which are showing 4321 as revision won’t update, all other disks got updated perfectly fine. I’ve upgraded the HA pair to 8.1.4P9 (which is the latest support on this platform, AFAIK).
Any idea how I can force the disk firmware update on those disks? Maybe in maintenance mode?
Alexander Griesser
Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.com
Web: http://www.anexia-it.com http://www.anexia-it.com/
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt
Geschäftsführer: Alexander Windbichler
Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: andrei.borzenkov@ts.fujitsu.com [mailto:andrei.borzenkov@ts.fujitsu.com] Gesendet: Mittwoch, 11. November 2015 11:41 An: Alexander Griesser AGriesser@anexia-it.com; toasters@teaparty.net Betreff: RE: How to verify the health of all disks?
You need to update drive firmware before continuing (probably in conjunction with DOT upgrade). See http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=853138 &Display=853138.
---
With best regards
Andrei Borzenkov
Senior system engineer
FTS WEMEAI RUC RU SC TMS FOS
cid:image001.gif@01CBF835.B3FEDA90
FUJITSU
Zemlyanoy Val Street, 9, 105 064 Moscow, Russian Federation
Tel.: +7 495 730 62 20 tel:%2B7%20495%20730%2062%2020 ( reception)
Mob.: +7 916 678 7208 tel:%2B7%20916%20678%207208
Fax: +7 495 730 62 14 tel:%2B7%20495%20730%2062%2014
E-mail: mailto:Andrei.Borzenkov@ts.fujitsu.com Andrei.Borzenkov@ts.fujitsu.com
Web: http://ts.fujitsu.com/ ru.fujitsu.com
Company details: http://ts.fujitsu.com/imprint.html ts.fujitsu.com/imprint
This communication contains information that is confidential, proprietary in nature and/or privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) or the person responsible for delivering it to the intended recipient(s), please note that any form of dissemination, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender and delete the original communication. Thank you for your cooperation.
Please be advised that neither Fujitsu, its affiliates, its employees or agents accept liability for any errors, omissions or damages caused by delays of receipt or by any virus infection in this message or its attachments, or which may otherwise arise as a result of this e-mail transmission.
From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Alexander Griesser Sent: Wednesday, November 11, 2015 11:19 AM To: toasters@teaparty.net Subject: How to verify the health of all disks?
Hi Toasters,
I had a pretty stunning experience on a customer’s used filer installation yesterday, long story short:
FSA2040 with 1xDS4243 24x1TB SATA disks, had to reinitialize the whole system because controller and shelves have been procured from different sources, so there was no root volume, etc.
After I figured the disk reassignment out and have wiped both filers, they both booted up with a 3 disk aggr0 raid_dp and I could start to work on it.
I then added a bunch of disks to filer #1 and continued with the configuration, until I found out, that some of the disks would not allow the firmware to be updated. The disks in question look like this:
*> disk_list
DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ
------------ ----- -------- ---------------- ---- -------------------- -- -------------- --
0d.11.2 SA:A NETAPP X302_WVULC01TSSM 4321 WD-WCAW30821984 ff 1953525168 512 N
Working disks look like this:
*> disk_list
DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ
------------ ----- -------- ---------------- ---- -------------------- -- -------------- --
0d.11.0 SA:A NETAPP X302_WVULC01TSSM NA02 WD-WCAW31217461 ff 1953525168 512 N
I’ve googled a bit and found out that disks showing up with REV „4321“ need to be replaced, there seems to have been a series of disks in the past with this error, so what I did was I pulled one of those disks out of the filer and replaced it with another one.
The system immediately started to reconstruct the now missing filesystem disk from the spare disk, when the log started to fill up about block errors on other disks during reconstruction, then about 10 minutes later, a double reconstruct was running and about 30 minutes later, the filer paniced due to multi disk failure and that’s where I ended up then.
So since there was no data on the filer, I wiped it again and am now back up and running with 3 disks in aggr0 and I’m currently replacing one of the „4321“ disks with another one:
*> disk replace start 0d.11.2 0d.11.3
aggr status -r
Aggregate aggr0 (online, raid_dp) (block checksums)
Plex /aggr0/plex0 (online, normal, active)
RAID group /aggr0/plex0/rg0 (normal, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity 0d.11.0 0d 11 0 SA:A - BSAS 7200 847555/1735794176 847884/1736466816
parity 0d.11.1 0d 11 1 SA:A - BSAS 7200 847555/1735794176 847884/1736466816
data 0d.11.2 0d 11 2 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (replacing, copy in progress)
-> copy 0d.11.3 0d 11 3 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (copy 0% completed)
The question now is: since the spare disks were all properly zeroed and there were no entries in the logs that would show me block errors on these disks, how can I verify, all my spare disks are really good? I’d love to run an intensive test on all the disks in order to make sure something like that doesn’t happen again when I put the filer in production.
I’m very thankful for any advice in this regard.
Is disk maint used for things like that?
*> disk maint list
Disk maint tests available
Test index: 0 Test Id: ws Test name: Write Same Test
Test index: 1 Test Id: ndst Test name: NDST Test
Test index: 2 Test Id: endst Test name: Extended NDST Test
Test index: 3 Test Id: vt Test name: Verify Test
Test index: 4 Test Id: ss Test name: Start Stop Test
Test index: 5 Test Id: dt Test name: Data Integrity Test
Test index: 6 Test Id: rdt Test name: Read Test
Test index: 7 Test Id: pc Test name: Power Cycle Test
Thanks,
Alexander Griesser
Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt
Geschäftsführer: Alexander Windbichler
Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Do I get this right, that I have to add all disks to the aggregate in order to be able to test them? Or create a new aggregate with a new volume and have it hammer on this volume, right?
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.comhttp://www.anexia-it.com/
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: Josef Kropf [mailto:josef.kropf@gmail.com] Gesendet: Mittwoch, 11. November 2015 14:17 An: 'Douglas Siggins' siggins@gmail.com; Alexander Griesser AGriesser@anexia-it.com Cc: toasters@teaparty.net Betreff: AW: AW: How to verify the health of all disks?
Hammer stresses the whole box to 100%
http://rajeev.name/2008/09/15/ontap-73-hammer/
Von: Douglas Siggins [mailto:siggins@gmail.com] Gesendet: Mittwoch, 11. November 2015 13:27 An: Alexander Griesser Cc: toasters@teaparty.netmailto:toasters@teaparty.net; Josef Kropf Betreff: Re: AW: How to verify the health of all disks?
I've found the tool called hammer also quite effective at shaking out bad disks. Run it for 3 days full bore. On Nov 11, 2015 6:30 AM, "Alexander Griesser" <AGriesser@anexia-it.commailto:AGriesser@anexia-it.com> wrote: Yes, they’re properly assigned.
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.comhttp://www.anexia-it.com/
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net] Im Auftrag von Josef Kropf Gesendet: Mittwoch, 11. November 2015 11:56 An: toasters@teaparty.netmailto:toasters@teaparty.net Betreff: AW: How to verify the health of all disks?
Did you assign the disk to the storage?
Issue a „disk show“ just 2 be shure…
Von: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Im Auftrag von Alexander Griesser Gesendet: Mittwoch, 11. November 2015 11:43 An: andrei.borzenkov@ts.fujitsu.commailto:andrei.borzenkov@ts.fujitsu.com; toasters@teaparty.netmailto:toasters@teaparty.net Betreff: AW: How to verify the health of all disks?
Hi,
the disks which are showing 4321 as revision won’t update, all other disks got updated perfectly fine. I’ve upgraded the HA pair to 8.1.4P9 (which is the latest support on this platform, AFAIK). Any idea how I can force the disk firmware update on those disks? Maybe in maintenance mode?
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.comhttp://www.anexia-it.com/
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: andrei.borzenkov@ts.fujitsu.commailto:andrei.borzenkov@ts.fujitsu.com [mailto:andrei.borzenkov@ts.fujitsu.com] Gesendet: Mittwoch, 11. November 2015 11:41 An: Alexander Griesser <AGriesser@anexia-it.commailto:AGriesser@anexia-it.com>; toasters@teaparty.netmailto:toasters@teaparty.net Betreff: RE: How to verify the health of all disks?
You need to update drive firmware before continuing (probably in conjunction with DOT upgrade). See http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=853138.
--- With best regards
Andrei Borzenkov Senior system engineer FTS WEMEAI RUC RU SC TMS FOS [cid:image001.gif@01CBF835.B3FEDA90] FUJITSU Zemlyanoy Val Street, 9, 105 064 Moscow, Russian Federation Tel.: +7 495 730 62 20tel:%2B7%20495%20730%2062%2020 ( reception) Mob.: +7 916 678 7208tel:%2B7%20916%20678%207208 Fax: +7 495 730 62 14tel:%2B7%20495%20730%2062%2014 E-mail: Andrei.Borzenkov@ts.fujitsu.commailto:Andrei.Borzenkov@ts.fujitsu.com Web: ru.fujitsu.comhttp://ts.fujitsu.com/ Company details: ts.fujitsu.com/imprinthttp://ts.fujitsu.com/imprint.html This communication contains information that is confidential, proprietary in nature and/or privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) or the person responsible for delivering it to the intended recipient(s), please note that any form of dissemination, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender and delete the original communication. Thank you for your cooperation. Please be advised that neither Fujitsu, its affiliates, its employees or agents accept liability for any errors, omissions or damages caused by delays of receipt or by any virus infection in this message or its attachments, or which may otherwise arise as a result of this e-mail transmission.
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Alexander Griesser Sent: Wednesday, November 11, 2015 11:19 AM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: How to verify the health of all disks?
Hi Toasters,
I had a pretty stunning experience on a customer’s used filer installation yesterday, long story short: FSA2040 with 1xDS4243 24x1TB SATA disks, had to reinitialize the whole system because controller and shelves have been procured from different sources, so there was no root volume, etc. After I figured the disk reassignment out and have wiped both filers, they both booted up with a 3 disk aggr0 raid_dp and I could start to work on it.
I then added a bunch of disks to filer #1 and continued with the configuration, until I found out, that some of the disks would not allow the firmware to be updated. The disks in question look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.2 SA:A NETAPP X302_WVULC01TSSM 4321 WD-WCAW30821984 ff 1953525168 512 N
Working disks look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.0 SA:A NETAPP X302_WVULC01TSSM NA02 WD-WCAW31217461 ff 1953525168 512 N
I’ve googled a bit and found out that disks showing up with REV „4321“ need to be replaced, there seems to have been a series of disks in the past with this error, so what I did was I pulled one of those disks out of the filer and replaced it with another one. The system immediately started to reconstruct the now missing filesystem disk from the spare disk, when the log started to fill up about block errors on other disks during reconstruction, then about 10 minutes later, a double reconstruct was running and about 30 minutes later, the filer paniced due to multi disk failure and that’s where I ended up then.
So since there was no data on the filer, I wiped it again and am now back up and running with 3 disks in aggr0 and I’m currently replacing one of the „4321“ disks with another one:
*> disk replace start 0d.11.2 0d.11.3 aggr status -r Aggregate aggr0 (online, raid_dp) (block checksums) Plex /aggr0/plex0 (online, normal, active) RAID group /aggr0/plex0/rg0 (normal, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0d.11.0 0d 11 0 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 parity 0d.11.1 0d 11 1 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 data 0d.11.2 0d 11 2 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (replacing, copy in progress) -> copy 0d.11.3 0d 11 3 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (copy 0% completed)
The question now is: since the spare disks were all properly zeroed and there were no entries in the logs that would show me block errors on these disks, how can I verify, all my spare disks are really good? I’d love to run an intensive test on all the disks in order to make sure something like that doesn’t happen again when I put the filer in production.
I’m very thankful for any advice in this regard. Is disk maint used for things like that?
*> disk maint list Disk maint tests available Test index: 0 Test Id: ws Test name: Write Same Test Test index: 1 Test Id: ndst Test name: NDST Test Test index: 2 Test Id: endst Test name: Extended NDST Test Test index: 3 Test Id: vt Test name: Verify Test Test index: 4 Test Id: ss Test name: Start Stop Test Test index: 5 Test Id: dt Test name: Data Integrity Test Test index: 6 Test Id: rdt Test name: Read Test Test index: 7 Test Id: pc Test name: Power Cycle Test
Thanks,
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Alright, obviously, this is what you have to do. I’ve now assigned all spare disks to one head and have created a new aggregate (easier to remove again then and chances that the filer panics again due to multi disk failures on aggr0 are lower) and have filled it up with a volume to 99.99% and am now running the following command:
*> hammer -f -1 /vol/hammervol/test.hammer 1048576
This is filling the volume rather quickly, CPU is on 99% almost all the time:
*> sysstat -x 1 CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 99% 0 0 0 85 0 1 12 69632 0 0 24s 100% 100% :f 33% 85 0 0 0 0 0 0 99% 0 0 0 0 1 0 1960 83144 0 0 24s 100% 82% Ff 42% 0 0 0 0 0 0 0 99% 0 0 0 8 0 0 140 65632 0 0 24s 100% 100% :v 35% 8 0 0 0 0 0 0 99% 0 0 0 0 0 0 3516 72696 0 0 24s 100% 79% Ff 38% 0 0 0 0 0 0 0
Will have that running for a while and monitor the logst o see if any more disks are starting to fail.
Best,
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Im Auftrag von Alexander Griesser Gesendet: Mittwoch, 11. November 2015 14:31 An: Josef Kropf josef.kropf@gmail.com; 'Douglas Siggins' siggins@gmail.com Cc: toasters@teaparty.net Betreff: AW: AW: How to verify the health of all disks?
Do I get this right, that I have to add all disks to the aggregate in order to be able to test them? Or create a new aggregate with a new volume and have it hammer on this volume, right?
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.comhttp://www.anexia-it.com/
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: Josef Kropf [mailto:josef.kropf@gmail.com] Gesendet: Mittwoch, 11. November 2015 14:17 An: 'Douglas Siggins' <siggins@gmail.commailto:siggins@gmail.com>; Alexander Griesser <AGriesser@anexia-it.commailto:AGriesser@anexia-it.com> Cc: toasters@teaparty.netmailto:toasters@teaparty.net Betreff: AW: AW: How to verify the health of all disks?
Hammer stresses the whole box to 100%
http://rajeev.name/2008/09/15/ontap-73-hammer/
Von: Douglas Siggins [mailto:siggins@gmail.com] Gesendet: Mittwoch, 11. November 2015 13:27 An: Alexander Griesser Cc: toasters@teaparty.netmailto:toasters@teaparty.net; Josef Kropf Betreff: Re: AW: How to verify the health of all disks?
I've found the tool called hammer also quite effective at shaking out bad disks. Run it for 3 days full bore. On Nov 11, 2015 6:30 AM, "Alexander Griesser" <AGriesser@anexia-it.commailto:AGriesser@anexia-it.com> wrote: Yes, they’re properly assigned.
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.comhttp://www.anexia-it.com/
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net] Im Auftrag von Josef Kropf Gesendet: Mittwoch, 11. November 2015 11:56 An: toasters@teaparty.netmailto:toasters@teaparty.net Betreff: AW: How to verify the health of all disks?
Did you assign the disk to the storage?
Issue a „disk show“ just 2 be shure…
Von: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Im Auftrag von Alexander Griesser Gesendet: Mittwoch, 11. November 2015 11:43 An: andrei.borzenkov@ts.fujitsu.commailto:andrei.borzenkov@ts.fujitsu.com; toasters@teaparty.netmailto:toasters@teaparty.net Betreff: AW: How to verify the health of all disks?
Hi,
the disks which are showing 4321 as revision won’t update, all other disks got updated perfectly fine. I’ve upgraded the HA pair to 8.1.4P9 (which is the latest support on this platform, AFAIK). Any idea how I can force the disk firmware update on those disks? Maybe in maintenance mode?
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.comhttp://www.anexia-it.com/
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: andrei.borzenkov@ts.fujitsu.commailto:andrei.borzenkov@ts.fujitsu.com [mailto:andrei.borzenkov@ts.fujitsu.com] Gesendet: Mittwoch, 11. November 2015 11:41 An: Alexander Griesser <AGriesser@anexia-it.commailto:AGriesser@anexia-it.com>; toasters@teaparty.netmailto:toasters@teaparty.net Betreff: RE: How to verify the health of all disks?
You need to update drive firmware before continuing (probably in conjunction with DOT upgrade). See http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=853138.
--- With best regards
Andrei Borzenkov Senior system engineer FTS WEMEAI RUC RU SC TMS FOS [cid:image001.gif@01CBF835.B3FEDA90] FUJITSU Zemlyanoy Val Street, 9, 105 064 Moscow, Russian Federation Tel.: +7 495 730 62 20tel:%2B7%20495%20730%2062%2020 ( reception) Mob.: +7 916 678 7208tel:%2B7%20916%20678%207208 Fax: +7 495 730 62 14tel:%2B7%20495%20730%2062%2014 E-mail: Andrei.Borzenkov@ts.fujitsu.commailto:Andrei.Borzenkov@ts.fujitsu.com Web: ru.fujitsu.comhttp://ts.fujitsu.com/ Company details: ts.fujitsu.com/imprinthttp://ts.fujitsu.com/imprint.html This communication contains information that is confidential, proprietary in nature and/or privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) or the person responsible for delivering it to the intended recipient(s), please note that any form of dissemination, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender and delete the original communication. Thank you for your cooperation. Please be advised that neither Fujitsu, its affiliates, its employees or agents accept liability for any errors, omissions or damages caused by delays of receipt or by any virus infection in this message or its attachments, or which may otherwise arise as a result of this e-mail transmission.
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Alexander Griesser Sent: Wednesday, November 11, 2015 11:19 AM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: How to verify the health of all disks?
Hi Toasters,
I had a pretty stunning experience on a customer’s used filer installation yesterday, long story short: FSA2040 with 1xDS4243 24x1TB SATA disks, had to reinitialize the whole system because controller and shelves have been procured from different sources, so there was no root volume, etc. After I figured the disk reassignment out and have wiped both filers, they both booted up with a 3 disk aggr0 raid_dp and I could start to work on it.
I then added a bunch of disks to filer #1 and continued with the configuration, until I found out, that some of the disks would not allow the firmware to be updated. The disks in question look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.2 SA:A NETAPP X302_WVULC01TSSM 4321 WD-WCAW30821984 ff 1953525168 512 N
Working disks look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.0 SA:A NETAPP X302_WVULC01TSSM NA02 WD-WCAW31217461 ff 1953525168 512 N
I’ve googled a bit and found out that disks showing up with REV „4321“ need to be replaced, there seems to have been a series of disks in the past with this error, so what I did was I pulled one of those disks out of the filer and replaced it with another one. The system immediately started to reconstruct the now missing filesystem disk from the spare disk, when the log started to fill up about block errors on other disks during reconstruction, then about 10 minutes later, a double reconstruct was running and about 30 minutes later, the filer paniced due to multi disk failure and that’s where I ended up then.
So since there was no data on the filer, I wiped it again and am now back up and running with 3 disks in aggr0 and I’m currently replacing one of the „4321“ disks with another one:
*> disk replace start 0d.11.2 0d.11.3 aggr status -r Aggregate aggr0 (online, raid_dp) (block checksums) Plex /aggr0/plex0 (online, normal, active) RAID group /aggr0/plex0/rg0 (normal, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0d.11.0 0d 11 0 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 parity 0d.11.1 0d 11 1 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 data 0d.11.2 0d 11 2 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (replacing, copy in progress) -> copy 0d.11.3 0d 11 3 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (copy 0% completed)
The question now is: since the spare disks were all properly zeroed and there were no entries in the logs that would show me block errors on these disks, how can I verify, all my spare disks are really good? I’d love to run an intensive test on all the disks in order to make sure something like that doesn’t happen again when I put the filer in production.
I’m very thankful for any advice in this regard. Is disk maint used for things like that?
*> disk maint list Disk maint tests available Test index: 0 Test Id: ws Test name: Write Same Test Test index: 1 Test Id: ndst Test name: NDST Test Test index: 2 Test Id: endst Test name: Extended NDST Test Test index: 3 Test Id: vt Test name: Verify Test Test index: 4 Test Id: ss Test name: Start Stop Test Test index: 5 Test Id: dt Test name: Data Integrity Test Test index: 6 Test Id: rdt Test name: Read Test Test index: 7 Test Id: pc Test name: Power Cycle Test
Thanks,
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
No, I am not aware of any fix for 4321 problem, sorry. As far as I know the problem was actually in dongle, not drive itself, so it is more or less cosmetic (of course it prevents you from updating firmware on these disks, which is bad).
--- With best regards
Andrei Borzenkov Senior system engineer FTS WEMEAI RUC RU SC TMS FOS [cid:image001.gif@01CBF835.B3FEDA90] FUJITSU Zemlyanoy Val Street, 9, 105 064 Moscow, Russian Federation Tel.: +7 495 730 62 20 ( reception) Mob.: +7 916 678 7208 Fax: +7 495 730 62 14 E-mail: Andrei.Borzenkov@ts.fujitsu.commailto:Andrei.Borzenkov@ts.fujitsu.com Web: ru.fujitsu.comhttp://ts.fujitsu.com/ Company details: ts.fujitsu.com/imprinthttp://ts.fujitsu.com/imprint.html This communication contains information that is confidential, proprietary in nature and/or privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) or the person responsible for delivering it to the intended recipient(s), please note that any form of dissemination, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender and delete the original communication. Thank you for your cooperation. Please be advised that neither Fujitsu, its affiliates, its employees or agents accept liability for any errors, omissions or damages caused by delays of receipt or by any virus infection in this message or its attachments, or which may otherwise arise as a result of this e-mail transmission.
From: Alexander Griesser [mailto:AGriesser@anexia-it.com] Sent: Wednesday, November 11, 2015 1:43 PM To: Borzenkov, Andrei; toasters@teaparty.net Subject: AW: How to verify the health of all disks?
Hi,
the disks which are showing 4321 as revision won’t update, all other disks got updated perfectly fine. I’ve upgraded the HA pair to 8.1.4P9 (which is the latest support on this platform, AFAIK). Any idea how I can force the disk firmware update on those disks? Maybe in maintenance mode?
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.comhttp://www.anexia-it.com/
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: andrei.borzenkov@ts.fujitsu.commailto:andrei.borzenkov@ts.fujitsu.com [mailto:andrei.borzenkov@ts.fujitsu.com] Gesendet: Mittwoch, 11. November 2015 11:41 An: Alexander Griesser <AGriesser@anexia-it.commailto:AGriesser@anexia-it.com>; toasters@teaparty.netmailto:toasters@teaparty.net Betreff: RE: How to verify the health of all disks?
You need to update drive firmware before continuing (probably in conjunction with DOT upgrade). See http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=853138.
--- With best regards
Andrei Borzenkov Senior system engineer FTS WEMEAI RUC RU SC TMS FOS [cid:image001.gif@01CBF835.B3FEDA90] FUJITSU Zemlyanoy Val Street, 9, 105 064 Moscow, Russian Federation Tel.: +7 495 730 62 20 ( reception) Mob.: +7 916 678 7208 Fax: +7 495 730 62 14 E-mail: Andrei.Borzenkov@ts.fujitsu.commailto:Andrei.Borzenkov@ts.fujitsu.com Web: ru.fujitsu.comhttp://ts.fujitsu.com/ Company details: ts.fujitsu.com/imprinthttp://ts.fujitsu.com/imprint.html This communication contains information that is confidential, proprietary in nature and/or privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) or the person responsible for delivering it to the intended recipient(s), please note that any form of dissemination, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender and delete the original communication. Thank you for your cooperation. Please be advised that neither Fujitsu, its affiliates, its employees or agents accept liability for any errors, omissions or damages caused by delays of receipt or by any virus infection in this message or its attachments, or which may otherwise arise as a result of this e-mail transmission.
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Alexander Griesser Sent: Wednesday, November 11, 2015 11:19 AM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: How to verify the health of all disks?
Hi Toasters,
I had a pretty stunning experience on a customer’s used filer installation yesterday, long story short: FSA2040 with 1xDS4243 24x1TB SATA disks, had to reinitialize the whole system because controller and shelves have been procured from different sources, so there was no root volume, etc. After I figured the disk reassignment out and have wiped both filers, they both booted up with a 3 disk aggr0 raid_dp and I could start to work on it.
I then added a bunch of disks to filer #1 and continued with the configuration, until I found out, that some of the disks would not allow the firmware to be updated. The disks in question look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.2 SA:A NETAPP X302_WVULC01TSSM 4321 WD-WCAW30821984 ff 1953525168 512 N
Working disks look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.0 SA:A NETAPP X302_WVULC01TSSM NA02 WD-WCAW31217461 ff 1953525168 512 N
I’ve googled a bit and found out that disks showing up with REV „4321“ need to be replaced, there seems to have been a series of disks in the past with this error, so what I did was I pulled one of those disks out of the filer and replaced it with another one. The system immediately started to reconstruct the now missing filesystem disk from the spare disk, when the log started to fill up about block errors on other disks during reconstruction, then about 10 minutes later, a double reconstruct was running and about 30 minutes later, the filer paniced due to multi disk failure and that’s where I ended up then.
So since there was no data on the filer, I wiped it again and am now back up and running with 3 disks in aggr0 and I’m currently replacing one of the „4321“ disks with another one:
*> disk replace start 0d.11.2 0d.11.3 aggr status -r Aggregate aggr0 (online, raid_dp) (block checksums) Plex /aggr0/plex0 (online, normal, active) RAID group /aggr0/plex0/rg0 (normal, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0d.11.0 0d 11 0 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 parity 0d.11.1 0d 11 1 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 data 0d.11.2 0d 11 2 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (replacing, copy in progress) -> copy 0d.11.3 0d 11 3 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (copy 0% completed)
The question now is: since the spare disks were all properly zeroed and there were no entries in the logs that would show me block errors on these disks, how can I verify, all my spare disks are really good? I’d love to run an intensive test on all the disks in order to make sure something like that doesn’t happen again when I put the filer in production.
I’m very thankful for any advice in this regard. Is disk maint used for things like that?
*> disk maint list Disk maint tests available Test index: 0 Test Id: ws Test name: Write Same Test Test index: 1 Test Id: ndst Test name: NDST Test Test index: 2 Test Id: endst Test name: Extended NDST Test Test index: 3 Test Id: vt Test name: Verify Test Test index: 4 Test Id: ss Test name: Start Stop Test Test index: 5 Test Id: dt Test name: Data Integrity Test Test index: 6 Test Id: rdt Test name: Read Test Test index: 7 Test Id: pc Test name: Power Cycle Test
Thanks,
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Alright, then I'll try to replace them all with other disks and will send them back to the supplier stating they're dfective. But the main question remains: How can I force an intensive disk check on all disks in the system?
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.comhttp://www.anexia-it.com/
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: andrei.borzenkov@ts.fujitsu.com [mailto:andrei.borzenkov@ts.fujitsu.com] Gesendet: Mittwoch, 11. November 2015 12:10 An: Alexander Griesser AGriesser@anexia-it.com; toasters@teaparty.net Betreff: RE: How to verify the health of all disks?
No, I am not aware of any fix for 4321 problem, sorry. As far as I know the problem was actually in dongle, not drive itself, so it is more or less cosmetic (of course it prevents you from updating firmware on these disks, which is bad).
--- With best regards
Andrei Borzenkov Senior system engineer FTS WEMEAI RUC RU SC TMS FOS [cid:image001.gif@01CBF835.B3FEDA90] FUJITSU Zemlyanoy Val Street, 9, 105 064 Moscow, Russian Federation Tel.: +7 495 730 62 20 ( reception) Mob.: +7 916 678 7208 Fax: +7 495 730 62 14 E-mail: Andrei.Borzenkov@ts.fujitsu.commailto:Andrei.Borzenkov@ts.fujitsu.com Web: ru.fujitsu.comhttp://ts.fujitsu.com/ Company details: ts.fujitsu.com/imprinthttp://ts.fujitsu.com/imprint.html This communication contains information that is confidential, proprietary in nature and/or privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) or the person responsible for delivering it to the intended recipient(s), please note that any form of dissemination, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender and delete the original communication. Thank you for your cooperation. Please be advised that neither Fujitsu, its affiliates, its employees or agents accept liability for any errors, omissions or damages caused by delays of receipt or by any virus infection in this message or its attachments, or which may otherwise arise as a result of this e-mail transmission.
From: Alexander Griesser [mailto:AGriesser@anexia-it.com] Sent: Wednesday, November 11, 2015 1:43 PM To: Borzenkov, Andrei; toasters@teaparty.netmailto:toasters@teaparty.net Subject: AW: How to verify the health of all disks?
Hi,
the disks which are showing 4321 as revision won't update, all other disks got updated perfectly fine. I've upgraded the HA pair to 8.1.4P9 (which is the latest support on this platform, AFAIK). Any idea how I can force the disk firmware update on those disks? Maybe in maintenance mode?
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.comhttp://www.anexia-it.com/
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: andrei.borzenkov@ts.fujitsu.commailto:andrei.borzenkov@ts.fujitsu.com [mailto:andrei.borzenkov@ts.fujitsu.com] Gesendet: Mittwoch, 11. November 2015 11:41 An: Alexander Griesser <AGriesser@anexia-it.commailto:AGriesser@anexia-it.com>; toasters@teaparty.netmailto:toasters@teaparty.net Betreff: RE: How to verify the health of all disks?
You need to update drive firmware before continuing (probably in conjunction with DOT upgrade). See http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=853138.
--- With best regards
Andrei Borzenkov Senior system engineer FTS WEMEAI RUC RU SC TMS FOS [cid:image001.gif@01CBF835.B3FEDA90] FUJITSU Zemlyanoy Val Street, 9, 105 064 Moscow, Russian Federation Tel.: +7 495 730 62 20 ( reception) Mob.: +7 916 678 7208 Fax: +7 495 730 62 14 E-mail: Andrei.Borzenkov@ts.fujitsu.commailto:Andrei.Borzenkov@ts.fujitsu.com Web: ru.fujitsu.comhttp://ts.fujitsu.com/ Company details: ts.fujitsu.com/imprinthttp://ts.fujitsu.com/imprint.html This communication contains information that is confidential, proprietary in nature and/or privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) or the person responsible for delivering it to the intended recipient(s), please note that any form of dissemination, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender and delete the original communication. Thank you for your cooperation. Please be advised that neither Fujitsu, its affiliates, its employees or agents accept liability for any errors, omissions or damages caused by delays of receipt or by any virus infection in this message or its attachments, or which may otherwise arise as a result of this e-mail transmission.
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Alexander Griesser Sent: Wednesday, November 11, 2015 11:19 AM To: toasters@teaparty.netmailto:toasters@teaparty.net Subject: How to verify the health of all disks?
Hi Toasters,
I had a pretty stunning experience on a customer's used filer installation yesterday, long story short: FSA2040 with 1xDS4243 24x1TB SATA disks, had to reinitialize the whole system because controller and shelves have been procured from different sources, so there was no root volume, etc. After I figured the disk reassignment out and have wiped both filers, they both booted up with a 3 disk aggr0 raid_dp and I could start to work on it.
I then added a bunch of disks to filer #1 and continued with the configuration, until I found out, that some of the disks would not allow the firmware to be updated. The disks in question look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.2 SA:A NETAPP X302_WVULC01TSSM 4321 WD-WCAW30821984 ff 1953525168 512 N
Working disks look like this:
*> disk_list DISK CHAN VENDOR PRODUCT ID REV SERIAL# HW (BLOCKS BPS) DQ ------------ ----- -------- ---------------- ---- -------------------- -- -------------- -- 0d.11.0 SA:A NETAPP X302_WVULC01TSSM NA02 WD-WCAW31217461 ff 1953525168 512 N
I've googled a bit and found out that disks showing up with REV "4321" need to be replaced, there seems to have been a series of disks in the past with this error, so what I did was I pulled one of those disks out of the filer and replaced it with another one. The system immediately started to reconstruct the now missing filesystem disk from the spare disk, when the log started to fill up about block errors on other disks during reconstruction, then about 10 minutes later, a double reconstruct was running and about 30 minutes later, the filer paniced due to multi disk failure and that's where I ended up then.
So since there was no data on the filer, I wiped it again and am now back up and running with 3 disks in aggr0 and I'm currently replacing one of the "4321" disks with another one:
*> disk replace start 0d.11.2 0d.11.3 aggr status -r Aggregate aggr0 (online, raid_dp) (block checksums) Plex /aggr0/plex0 (online, normal, active) RAID group /aggr0/plex0/rg0 (normal, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0d.11.0 0d 11 0 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 parity 0d.11.1 0d 11 1 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 data 0d.11.2 0d 11 2 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (replacing, copy in progress) -> copy 0d.11.3 0d 11 3 SA:A - BSAS 7200 847555/1735794176 847884/1736466816 (copy 0% completed)
The question now is: since the spare disks were all properly zeroed and there were no entries in the logs that would show me block errors on these disks, how can I verify, all my spare disks are really good? I'd love to run an intensive test on all the disks in order to make sure something like that doesn't happen again when I put the filer in production.
I'm very thankful for any advice in this regard. Is disk maint used for things like that?
*> disk maint list Disk maint tests available Test index: 0 Test Id: ws Test name: Write Same Test Test index: 1 Test Id: ndst Test name: NDST Test Test index: 2 Test Id: endst Test name: Extended NDST Test Test index: 3 Test Id: vt Test name: Verify Test Test index: 4 Test Id: ss Test name: Start Stop Test Test index: 5 Test Id: dt Test name: Data Integrity Test Test index: 6 Test Id: rdt Test name: Read Test Test index: 7 Test Id: pc Test name: Power Cycle Test
Thanks,
Alexander Griesser Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com Web: http://www.anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt Geschäftsführer: Alexander Windbichler Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601