AW: How to verify the health of all disks?

11 Nov 2015


      BTW, just did the same on the second head in this filer, it's configured as passive node (so only 3 disks, raid4 for the root aggregate) and this seems to be happening here too now:
*> aggr status -r
Aggregate aggr0 (online, raid4) (block checksums)
  Plex /aggr0/plex0 (online, normal, active)
    RAID group /aggr0/plex0/rg0 (normal, block checksums)
RAID Disk Device          HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------          ------------- ---- ---- ---- ----- --------------    --------------
      parity    0d.11.22        0d    11  22  SA:B   -  BSAS  7200 847555/1735794176 847884/1736466816 (replacing, copy in progress)
      -> copy   0d.11.9         0d    11  9   SA:B   -  BSAS  7200 847555/1735794176 847884/1736466816 (copy 3% completed)
      data      0d.11.23        0d    11  23  SA:B   -  BSAS  7200 847555/1735794176 847884/1736466816
Wed Nov 11 09:41:36 CET [:disk.ioMediumError:warning]: Medium error on disk 0d.11.23: op 0x28:746efd00:0050 sector 1953430830 SCSI:medium error - Unrecovered read error - If the disk is in a RAID group, the subsystem will attempt to reconstruct unreadable data (3 11 0 d4) (1943) [NETAPP   X302_WVULC01TSSM 4321] S/N [WD-WCAW31422720]
Wed Nov 11 09:41:36 CET [:disk.ioFailed:error]: I/O operation failed despite several retries.
For now, it's still running - but I'm just waiting for it to also break and since this is a raid4, the second failed disk will immediately cancel the process :-/
Best,
Alexander Griesser
Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com
Web: http://www.anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt
Geschäftsführer: Alexander Windbichler
Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601
Von: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Im Auftrag von Alexander Griesser
Gesendet: Mittwoch, 11. November 2015 09:19
An: toasters@teaparty.net
Betreff: How to verify the health of all disks?
Hi Toasters,
I had a pretty stunning experience on a customer's used filer installation yesterday, long story short:
FSA2040 with 1xDS4243 24x1TB SATA disks, had to reinitialize the whole system because controller and shelves have been procured from different sources, so there was no root volume, etc.
After I figured the disk reassignment out and have wiped both filers, they both booted up with a 3 disk aggr0 raid_dp and I could start to work on it.
I then added a bunch of disks to filer #1 and continued with the configuration, until I found out, that some of the disks would not allow the firmware to be updated. The disks in question look like this:
*> disk_list
     DISK    CHAN  VENDOR   PRODUCT ID       REV  SERIAL#              HW (BLOCKS   BPS) DQ
------------ ----- -------- ---------------- ---- -------------------- -- -------------- --
0d.11.2       SA:A NETAPP   X302_WVULC01TSSM 4321 WD-WCAW30821984      ff 1953525168  512  N
Working disks look like this:
*> disk_list
     DISK    CHAN  VENDOR   PRODUCT ID       REV  SERIAL#              HW (BLOCKS   BPS) DQ
------------ ----- -------- ---------------- ---- -------------------- -- -------------- --
0d.11.0       SA:A NETAPP   X302_WVULC01TSSM NA02 WD-WCAW31217461      ff 1953525168  512  N
I've googled a bit and found out that disks showing up with REV "4321" need to be replaced, there seems to have been a series of disks in the past with this error, so what I did was I pulled one of those disks out of the filer and replaced it with another one.
The system immediately started to reconstruct the now missing filesystem disk from the spare disk, when the log started to fill up about block errors on other disks during reconstruction, then about 10 minutes later, a double reconstruct was running and about 30 minutes later, the filer paniced due to multi disk failure and that's where I ended up then.
So since there was no data on the filer, I wiped it again and am now back up and running with 3 disks in aggr0 and I'm currently replacing one of the "4321" disks with another one:
*> disk replace start 0d.11.2 0d.11.3
aggr status -r
Aggregate aggr0 (online, raid_dp) (block checksums)
  Plex /aggr0/plex0 (online, normal, active)
    RAID group /aggr0/plex0/rg0 (normal, block checksums)
RAID Disk Device          HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------          ------------- ---- ---- ---- ----- --------------    --------------
      dparity   0d.11.0         0d    11  0   SA:A   -  BSAS  7200 847555/1735794176 847884/1736466816
      parity    0d.11.1         0d    11  1   SA:A   -  BSAS  7200 847555/1735794176 847884/1736466816
      data      0d.11.2         0d    11  2   SA:A   -  BSAS  7200 847555/1735794176 847884/1736466816 (replacing, copy in progress)
      -> copy   0d.11.3         0d    11  3   SA:A   -  BSAS  7200 847555/1735794176 847884/1736466816 (copy 0% completed)
The question now is: since the spare disks were all properly zeroed and there were no entries in the logs that would show me block errors on these disks, how can I verify, all my spare disks are really good? I'd love to run an intensive test on all the disks in order to make sure something like that doesn't happen again when I put the filer in production.
I'm very thankful for any advice in this regard.
Is disk maint used for things like that?
*> disk maint list
Disk maint tests available
Test index: 0    Test Id: ws       Test name: Write Same Test
Test index: 1    Test Id: ndst     Test name: NDST Test
Test index: 2    Test Id: endst    Test name: Extended NDST Test
Test index: 3    Test Id: vt       Test name: Verify Test
Test index: 4    Test Id: ss       Test name: Start Stop Test
Test index: 5    Test Id: dt       Test name: Data Integrity Test
Test index: 6    Test Id: rdt      Test name: Read Test
Test index: 7    Test Id: pc       Test name: Power Cycle Test
Thanks,
Alexander Griesser
Head of Systems Operations
ANEXIA Internetdienstleistungs GmbH
E-Mail: AGriesser@anexia-it.commailto:AGriesser@anexia-it.com
Web: http://www.anexia-it.com
Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt
Geschäftsführer: Alexander Windbichler
Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

AW: How to verify the health of all disks?