bypassed disks - whats the deal? - toasters

List overview All Threads
Download

newer

bypassed disks - whats the deal?

older

Snapdrive failing to create a...

migrating vol0 (root) to new aggr

Fletcher Cocquyt

24 Feb 2012 24 Feb '12

7:05 a.m.

We've discovered a couple of these bypassed disk conditions via the flashing amber light - but this was noticed totally out of band with normal support

Each time we opened a case manually and Netapp immediately sent out a disk replacement. So why is a bypassed disk not treated as a failed disk ? This kind of silent failure (in terms of Netapp monitoring and alerts) in a lights out datacenter seems negligent.

Message logged on syslog server: esh.bypass.err.disk:error]: Disk 4d.49 on channels 4d/PARTNER disk shelf ID 3 ESH A bay 1 Bypassed due to the drive self bypass.

BTW: I read the KB article on bypassed disks and ran the CMD to highlight BYP but it did not show BYP https://kb.netapp.com/support/index?page=content&id=3012395

thanks,

Fletcher

Attachments:

attachment.html (text/html — 7.6 KB)

Show replies by date

Jacek

24 Feb 24 Feb

9:20 a.m.

On 2012-02-24 08:05, Fletcher Cocquyt wrote:

...

We've discovered a couple of these bypassed disk conditions via the flashing amber light - but this was noticed totally out of band with normal support

Each time we opened a case manually and Netapp immediately sent out a disk replacement. So why is a bypassed disk not treated as a failed disk ? This kind of silent failure (in terms of Netapp monitoring and alerts) in a lights out datacenter seems negligent.

Message logged on syslog server: esh.bypass.err.disk:error]: Disk 4d.49 on channels 4d/PARTNER disk shelf ID 3 ESH A bay 1 Bypassed due to the drive self bypass.

In my previous job I worked as NAS admin managing about 100 filers. I had long discussions with NetApp but it looks like they do not understand the problem: - Why the disk is bypassed? - Because it achieved threshold of errors and it was pro-actively removed from the disk pool. - So it was actually failed and should be replaced. Why is it not marked as failed and filer status does not reflect it? - Because the disk is not failed. It is bypassed. ...

And so on...

...

BTW: I read the KB article on bypassed disks and ran the CMD to highlight BYP but it did not show BYP https://kb.netapp.com/support/index?page=content&id=3012395

We maintained our own script that collected data from several commands to be aware of any type of disk problems. It always picked up bypassed disks even if it was not marked as BYP.

We observed that number of all disk problems decreased when we started to use Disk Maintenance Center however sometimes we had to start disk tests manually.

Best regards,

Jacek

Fletcher Cocquyt

9:59 a.m.

But this is totally unacceptable! Who else is putting up with this!?

On Feb 24, 2012, at 1:20 AM, Jacek wrote:

...

On 2012-02-24 08:05, Fletcher Cocquyt wrote:

...
We've discovered a couple of these bypassed disk conditions via the flashing amber light - but this was noticed totally out of band with normal support

Each time we opened a case manually and Netapp immediately sent out a disk replacement. So why is a bypassed disk not treated as a failed disk ? This kind of silent failure (in terms of Netapp monitoring and alerts) in a lights out datacenter seems negligent.

Message logged on syslog server: esh.bypass.err.disk:error]: Disk 4d.49 on channels 4d/PARTNER disk shelf ID 3 ESH A bay 1 Bypassed due to the drive self bypass.

In my previous job I worked as NAS admin managing about 100 filers. I had long discussions with NetApp but it looks like they do not understand the problem:

Why the disk is bypassed?

Because it achieved threshold of errors and it was pro-actively removed from the disk pool.

So it was actually failed and should be replaced. Why is it not marked as failed and filer status does not reflect it?

Because the disk is not failed. It is bypassed.

...

And so on...

...
BTW: I read the KB article on bypassed disks and ran the CMD to highlight BYP but it did not show BYP https://kb.netapp.com/support/index?page=content&id=3012395

We maintained our own script that collected data from several commands to be aware of any type of disk problems. It always picked up bypassed disks even if it was not marked as BYP.

We observed that number of all disk problems decreased when we started to use Disk Maintenance Center however sometimes we had to start disk tests manually.

Best regards,

Jacek

tmac

10:42 a.m.

If the same slot continues to show BYP with new disks, it is likely a bad shelf.

--tmac Tim McCarthy Principal Consultant

RedHat Certified Engineer 804006984323821 (RHEL4) 805007643429572 (RHEL5)

2012/2/24 Fletcher Cocquyt fcocquyt@stanford.edu:

...

But this is totally unacceptable! Who else is putting up with this!?

On Feb 24, 2012, at 1:20 AM, Jacek wrote:

On 2012-02-24 08:05, Fletcher Cocquyt wrote:

We've discovered a couple of these bypassed disk conditions via the flashing amber light - but this was noticed totally out of band with normal support

Each time we opened a case manually and Netapp immediately sent out a disk replacement.

So why is a bypassed disk not treated as a failed disk ? This kind of silent failure (in terms of Netapp monitoring and alerts) in a lights out datacenter seems negligent.

Message logged on syslog server:

esh.bypass.err.disk:error]: Disk 4d.49 on channels 4d/PARTNER disk shelf ID 3 ESH A bay 1 Bypassed due to the drive self bypass.

In my previous job I worked as NAS admin managing about 100 filers. I had long discussions with NetApp but it looks like they do not understand the problem:

Why the disk is bypassed?

Because it achieved threshold of errors and it was pro-actively removed

from the disk pool.

So it was actually failed and should be replaced. Why is it not marked as

failed and filer status does not reflect it?

Because the disk is not failed. It is bypassed.

...

And so on...

BTW: I read the KB article on bypassed disks and ran the CMD to highlight BYP but it did not show BYP

https://kb.netapp.com/support/index?page=content&id=3012395

We maintained our own script that collected data from several commands to be aware of any type of disk problems. It always picked up bypassed disks even if it was not marked as BYP.

We observed that number of all disk problems decreased when we started to use Disk Maintenance Center however sometimes we had to start disk tests manually.

Best regards,

Jacek

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Jeff Mohler

11:52 a.m.

Im of the mind that a BYP needs more attention that just a disk swap..but it does need more attention from NGS, it would appear.

On Fri, Feb 24, 2012 at 6:42 PM, tmac tmacmd@gmail.com wrote:

...

If the same slot continues to show BYP with new disks, it is likely a bad shelf.

--tmac Tim McCarthy Principal Consultant

RedHat Certified Engineer 804006984323821 (RHEL4) 805007643429572 (RHEL5)

2012/2/24 Fletcher Cocquyt fcocquyt@stanford.edu:

...
But this is totally unacceptable! Who else is putting up with this!?

On Feb 24, 2012, at 1:20 AM, Jacek wrote:

On 2012-02-24 08:05, Fletcher Cocquyt wrote:

We've discovered a couple of these bypassed disk conditions via the

flashing

...
amber light - but this was noticed totally out of band with normal

support

...
Each time we opened a case manually and Netapp immediately sent out a

disk

...
replacement.

So why is a bypassed disk not treated as a failed disk ? This kind of silent failure (in terms of Netapp monitoring and alerts) in a lights out datacenter seems negligent.

Message logged on syslog server:

esh.bypass.err.disk:error]: Disk 4d.49 on channels 4d/PARTNER disk shelf

ID

...
3 ESH A bay 1 Bypassed due to the drive self bypass.

In my previous job I worked as NAS admin managing about 100 filers. I had long discussions with NetApp but it looks like they do not understand the problem:

Why the disk is bypassed?

Because it achieved threshold of errors and it was pro-actively removed

from the disk pool.

So it was actually failed and should be replaced. Why is it not marked

as

...
failed and filer status does not reflect it?

Because the disk is not failed. It is bypassed.

...

And so on...

BTW: I read the KB article on bypassed disks and ran the CMD to highlight BYP but it did not show BYP

https://kb.netapp.com/support/index?page=content&id=3012395

We maintained our own script that collected data from several commands

to be

...
aware of any type of disk problems. It always picked up bypassed disks

even

...
if it was not marked as BYP.

We observed that number of all disk problems decreased when we started to use Disk Maintenance Center however sometimes we had to start disk tests manually.

Best regards,

Jacek

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

-- --- Gustatus Similis Pullus

jfs

9:33 p.m.

I bought my first NetApp in 1995 and I've never seen anything like this with one exception - a DS14 with a bad backplane. It did take a while to diagnose because nobody ever saw a backplane fail like that before.

On 02/24/2012 09:59 AM, Fletcher Cocquyt wrote:

...

But this is totally unacceptable! Who else is putting up with this!?

On Feb 24, 2012, at 1:20 AM, Jacek wrote:

...
On 2012-02-24 08:05, Fletcher Cocquyt wrote:

...
We've discovered a couple of these bypassed disk conditions via the flashing amber light - but this was noticed totally out of band with normal support

Each time we opened a case manually and Netapp immediately sent out a disk replacement. So why is a bypassed disk not treated as a failed disk ? This kind of silent failure (in terms of Netapp monitoring and alerts) in a lights out datacenter seems negligent.

Message logged on syslog server: esh.bypass.err.disk:error]: Disk 4d.49 on channels 4d/PARTNER disk shelf ID 3 ESH A bay 1 Bypassed due to the drive self bypass.

In my previous job I worked as NAS admin managing about 100 filers. I had long discussions with NetApp but it looks like they do not understand the problem:

Why the disk is bypassed?

Because it achieved threshold of errors and it was pro-actively

removed from the disk pool.

So it was actually failed and should be replaced. Why is it not

marked as failed and filer status does not reflect it?

Because the disk is not failed. It is bypassed.

...

And so on...

...
BTW: I read the KB article on bypassed disks and ran the CMD to highlight BYP but it did not show BYP https://kb.netapp.com/support/index?page=content&id=3012395 https://kb.netapp.com/support/index?page=content&id=3012395

We maintained our own script that collected data from several commands to be aware of any type of disk problems. It always picked up bypassed disks even if it was not marked as BYP.

We observed that number of all disk problems decreased when we started to use Disk Maintenance Center however sometimes we had to start disk tests manually.

Best regards,

Jacek

Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Webster, Stetson

26 Feb 26 Feb

5:22 p.m.

Please consider updating disk and shelf firmware. In my experience, this has solved 100% of BYP disk conditions, although I'm sure there are exceptions. It's highly likely that this problem has already been fixed with disk and/or shelf firmware. I'd pick that low hanging fruit first because ESH firmware updates are usually non-disruptive.

Sent from my iThumbs. Please pardon my thumbmanship.

On Feb 26, 2012, at 11:48 AM, "jfs" <jfsinmsp@gmail.commailto:jfsinmsp@gmail.com> wrote:

On 02/24/2012 09:59 AM, Fletcher Cocquyt wrote: But this is totally unacceptable! Who else is putting up with this!?

On Feb 24, 2012, at 1:20 AM, Jacek wrote:

On 2012-02-24 08:05, Fletcher Cocquyt wrote: We've discovered a couple of these bypassed disk conditions via the flashing amber light - but this was noticed totally out of band with normal support

Message logged on syslog server: esh.bypass.err.disk:error]: Disk 4d.49 on channels 4d/PARTNER disk shelf ID 3 ESH A bay 1 Bypassed due to the drive self bypass.

And so on...

BTW: I read the KB article on bypassed disks and ran the CMD to highlight BYP but it did not show BYP https://kb.netapp.com/support/index?page=content&id=3012395

We maintained our own script that collected data from several commands to be aware of any type of disk problems. It always picked up bypassed disks even if it was not marked as BYP.

We observed that number of all disk problems decreased when we started to use Disk Maintenance Center however sometimes we had to start disk tests manually.

Best regards,

Jacek

_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

4925

Age (days ago)

4927

Last active (days ago)

toasters@lists.teaparty.net

6 comments

6 participants

tags (0)

participants (6)

Fletcher Cocquyt
Jacek
Jeff Mohler
jfs
tmac
Webster, Stetson