Alright, obviously, this is what you have to do.

I’ve now assigned all spare disks to one head and have created a new aggregate (easier to remove again then and chances that the filer panics again due to multi disk failures on aggr0 are lower) and have filled it up with a volume to 99.99% and am now running the following command:

 

*> hammer -f -1 /vol/hammervol/test.hammer 1048576

 

This is filling the volume rather quickly, CPU is on 99% almost all the time:

 

*> sysstat -x 1

CPU    NFS   CIFS   HTTP   Total     Net   kB/s    Disk   kB/s    Tape   kB/s  Cache  Cache    CP  CP  Disk   OTHER    FCP  iSCSI     FCP   kB/s   iSCSI   kB/s

                                       in    out    read  write    read  write    age    hit  time  ty  util                            in    out      in    out

99%      0      0      0      85       0      1      12  69632       0      0    24s   100%  100%  :f   33%      85      0      0       0      0       0      0

99%      0      0      0       0       1      0    1960  83144       0      0    24s   100%   82%  Ff   42%       0      0      0       0      0       0      0

99%      0      0      0       8       0      0     140  65632       0      0    24s   100%  100%  :v   35%       8      0      0       0      0       0      0

99%      0      0      0       0       0      0    3516  72696       0      0    24s   100%   79%  Ff   38%       0      0      0       0      0       0      0

 

Will have that running for a while and monitor the logst o see if any more disks are starting to fail.

 

Best,

 

Alexander Griesser

Head of Systems Operations

 

ANEXIA Internetdienstleistungs GmbH

 

E-Mail: AGriesser@anexia-it.com

Web: http://www.anexia-it.com

 

Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt

Geschäftsführer: Alexander Windbichler

Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601

 

Von: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Im Auftrag von Alexander Griesser
Gesendet: Mittwoch, 11. November 2015 14:31
An: Josef Kropf <josef.kropf@gmail.com>; 'Douglas Siggins' <siggins@gmail.com>
Cc: toasters@teaparty.net
Betreff: AW: AW: How to verify the health of all disks?

 

Do I get this right, that I have to add all disks to the aggregate in order to be able to test them? Or create a new aggregate with a new volume and have it hammer on this volume, right?

 

Alexander Griesser

Head of Systems Operations

 

ANEXIA Internetdienstleistungs GmbH

 

E-Mail: AGriesser@anexia-it.com

Web: http://www.anexia-it.com

 

Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt

Geschäftsführer: Alexander Windbichler

Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601

 

Von: Josef Kropf [mailto:josef.kropf@gmail.com]
Gesendet: Mittwoch, 11. November 2015 14:17
An: 'Douglas Siggins' <siggins@gmail.com>; Alexander Griesser <AGriesser@anexia-it.com>
Cc: toasters@teaparty.net
Betreff: AW: AW: How to verify the health of all disks?

 

Hammer stresses the whole box to 100%

 

http://rajeev.name/2008/09/15/ontap-73-hammer/

 

 

 

Von: Douglas Siggins [mailto:siggins@gmail.com]
Gesendet: Mittwoch, 11. November 2015 13:27
An: Alexander Griesser
Cc: toasters@teaparty.net; Josef Kropf
Betreff: Re: AW: How to verify the health of all disks?

 

I've found the tool called hammer also quite effective at shaking out bad disks. Run it for 3 days full bore.

On Nov 11, 2015 6:30 AM, "Alexander Griesser" <AGriesser@anexia-it.com> wrote:

Yes, they’re properly assigned.

 

Alexander Griesser

Head of Systems Operations

 

ANEXIA Internetdienstleistungs GmbH

 

E-Mail: AGriesser@anexia-it.com

Web: http://www.anexia-it.com

 

Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt

Geschäftsführer: Alexander Windbichler

Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601

 

Von: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Im Auftrag von Josef Kropf
Gesendet: Mittwoch, 11. November 2015 11:56
An: toasters@teaparty.net
Betreff: AW: How to verify the health of all disks?

 

Did you assign the disk to the storage?

 

Issue a „disk show“ just 2 be shure…

 

 

 

Von: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Im Auftrag von Alexander Griesser
Gesendet: Mittwoch, 11. November 2015 11:43
An: andrei.borzenkov@ts.fujitsu.com; toasters@teaparty.net
Betreff: AW: How to verify the health of all disks?

 

Hi,

 

the disks which are showing 4321 as revision won’t update, all other disks got updated perfectly fine. I’ve upgraded the HA pair to 8.1.4P9 (which is the latest support on this platform, AFAIK).

Any idea how I can force the disk firmware update on those disks? Maybe in maintenance mode?

 

Alexander Griesser

Head of Systems Operations

 

ANEXIA Internetdienstleistungs GmbH

 

E-Mail: AGriesser@anexia-it.com

Web: http://www.anexia-it.com

 

Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt

Geschäftsführer: Alexander Windbichler

Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601

 

Von: andrei.borzenkov@ts.fujitsu.com [mailto:andrei.borzenkov@ts.fujitsu.com]
Gesendet: Mittwoch, 11. November 2015 11:41
An: Alexander Griesser <AGriesser@anexia-it.com>; toasters@teaparty.net
Betreff: RE: How to verify the health of all disks?

 

You need to update drive firmware before continuing (probably in conjunction with DOT upgrade). See http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=853138.

 

---

With best regards

 

Andrei Borzenkov

Senior system engineer

FTS WEMEAI RUC RU SC TMS FOS

cid:image001.gif@01CBF835.B3FEDA90

FUJITSU

Zemlyanoy Val Street, 9, 105 064 Moscow, Russian Federation

Tel.: +7 495 730 62 20 ( reception)

Mob.: +7 916 678 7208

Fax: +7 495 730 62 14

E-mail: Andrei.Borzenkov@ts.fujitsu.com

Web: ru.fujitsu.com

Company details: ts.fujitsu.com/imprint

This communication contains information that is confidential, proprietary in nature and/or privileged.  It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) or the person responsible for delivering it to the intended recipient(s), please note that any form of dissemination, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender and delete the original communication. Thank you for your cooperation.

Please be advised that neither Fujitsu, its affiliates, its employees or agents accept liability for any errors, omissions or damages caused by delays of receipt or by any virus infection in this message or its attachments, or which may otherwise arise as a result of this e-mail transmission.

 

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Alexander Griesser
Sent: Wednesday, November 11, 2015 11:19 AM
To: toasters@teaparty.net
Subject: How to verify the health of all disks?

 

Hi Toasters,

 

I had a pretty stunning experience on a customer’s used filer installation yesterday, long story short:

FSA2040 with 1xDS4243 24x1TB SATA disks, had to reinitialize the whole system because controller and shelves have been procured from different sources, so there was no root volume, etc.

After I figured the disk reassignment out and have wiped both filers, they both booted up with a 3 disk aggr0 raid_dp and I could start to work on it.

 

I then added a bunch of disks to filer #1 and continued with the configuration, until I found out, that some of the disks would not allow the firmware to be updated. The disks in question look like this:

 

*> disk_list

     DISK    CHAN  VENDOR   PRODUCT ID       REV  SERIAL#              HW (BLOCKS   BPS) DQ

------------ ----- -------- ---------------- ---- -------------------- -- -------------- --

0d.11.2       SA:A NETAPP   X302_WVULC01TSSM 4321 WD-WCAW30821984      ff 1953525168  512  N

 

Working disks look like this:

 

*> disk_list

     DISK    CHAN  VENDOR   PRODUCT ID       REV  SERIAL#              HW (BLOCKS   BPS) DQ

------------ ----- -------- ---------------- ---- -------------------- -- -------------- --

0d.11.0       SA:A NETAPP   X302_WVULC01TSSM NA02 WD-WCAW31217461      ff 1953525168  512  N

 

I’ve googled a bit and found out that disks showing up with REV „4321“ need to be replaced, there seems to have been a series of disks in the past with this error, so what I did was I pulled one of those disks out of the filer and replaced it with another one.

The system immediately started to reconstruct the now missing filesystem disk from the spare disk, when the log started to fill up about block errors on other disks during reconstruction, then about 10 minutes later, a double reconstruct was running and about 30 minutes later, the filer paniced due to multi disk failure and that’s where I ended up then.

 

So since there was no data on the filer, I wiped it again and am now back up and running with 3 disks in aggr0 and I’m currently replacing one of the „4321“ disks with another one:

 

*> disk replace start 0d.11.2 0d.11.3

aggr status -r

Aggregate aggr0 (online, raid_dp) (block checksums)

  Plex /aggr0/plex0 (online, normal, active)

    RAID group /aggr0/plex0/rg0 (normal, block checksums)

 

      RAID Disk Device          HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)

      --------- ------          ------------- ---- ---- ---- ----- --------------    --------------

      dparity   0d.11.0         0d    11  0   SA:A   -  BSAS  7200 847555/1735794176 847884/1736466816

      parity    0d.11.1         0d    11  1   SA:A   -  BSAS  7200 847555/1735794176 847884/1736466816

      data      0d.11.2         0d    11  2   SA:A   -  BSAS  7200 847555/1735794176 847884/1736466816 (replacing, copy in progress)

      -> copy   0d.11.3         0d    11  3   SA:A   -  BSAS  7200 847555/1735794176 847884/1736466816 (copy 0% completed)

 

The question now is: since the spare disks were all properly zeroed and there were no entries in the logs that would show me block errors on these disks, how can I verify, all my spare disks are really good? I’d love to run an intensive test on all the disks in order to make sure something like that doesn’t happen again when I put the filer in production.

 

I’m very thankful for any advice in this regard.

Is disk maint used for things like that?

 

*> disk maint list

Disk maint tests available

Test index: 0    Test Id: ws       Test name: Write Same Test

Test index: 1    Test Id: ndst     Test name: NDST Test

Test index: 2    Test Id: endst    Test name: Extended NDST Test

Test index: 3    Test Id: vt       Test name: Verify Test

Test index: 4    Test Id: ss       Test name: Start Stop Test

Test index: 5    Test Id: dt       Test name: Data Integrity Test

Test index: 6    Test Id: rdt      Test name: Read Test

Test index: 7    Test Id: pc       Test name: Power Cycle Test

 

Thanks,

 

Alexander Griesser

Head of Systems Operations

 

ANEXIA Internetdienstleistungs GmbH

 

E-Mail: AGriesser@anexia-it.com

Web: http://www.anexia-it.com

 

Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt

Geschäftsführer: Alexander Windbichler

Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601

 


_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters