Re:ISP_2100: cmd opcode 0x28, retry count 0: command timeout, aborting request error

List overview All Threads
Download

newer

older

Re: show_therm

raster＠netapp.com

17 May 1999 17 May '99

10:13 p.m.

...

Can anyone shed some light on the error below? It looks like it is just a media error which was recovered from, but I wanted verification. Thank you, Mike

Wed May 12 00:13:03 EDT [isp2100_timeout]: 8.45 (cmd opcode 0x28, retry count 0): command timeout, aborting request Wed May 12 00:13:03 EDT [isp2100_timeout]: isp2100_reset_device: device 8.45 (1.45)

Hi Mike,

The above message sequence is a symptom of SCSI command timeout. As part of timeout recovery, the device in question is reset and all pending commands are reissued.

Hope this helps.

-- --------------------------------------------------------------------------- Radek Aster [ raster_at_netapp_dot_com ] | "I bike, therefore I am" http://www.nowhere.net/~raster/ | - Rene' DesBiketes ---------------------------------------------------------------------------

Show replies by date

Ethan Torretta

18 May 18 May

11:21 p.m.

New subject: ISP_2100: cmd opcode 0x28, retry count 0: command timeout, aborting request error

On Mon, 17 May 1999, Radek Aster wrote:

...

...
Wed May 12 00:13:03 EDT [isp2100_timeout]: 8.45 (cmd opcode 0x28, retry count 0): command timeout, aborting request Wed May 12 00:13:03 EDT [isp2100_timeout]: isp2100_reset_device: device 8.45 (1.45)

The above message sequence is a symptom of SCSI command timeout. As part of timeout recovery, the device in question is reset and all pending commands are reissued.

How would you characterize the severity of the message, though? I'm inclined to take small numbers of recoverable errors as normal operation, but I work in a company with a very noisy monitoring team that escalates every single error to completion, often to me. As a result it pays to be able to state firmly whether they should disregard certain errors (assuming a certain threshold for frequency, etc.). The timeout/reset error in particular seems harmless but, with more than twenty netapps in use, occurs just often enough to be irritating.

ejt

raster＠netapp.com

6:19 p.m.

New subject: ISP_2100: cmd opcode 0x28, retry count 0: command timeout, aborting request error

...

...
...
...
...
"Ethan" == Ethan Torretta ethantor@corp.webtv.net writes:

Ethan> How would you characterize the severity of the message, though? I'm Ethan> inclined to take small numbers of recoverable errors as normal Ethan> operation, but I work in a company with a very noisy monitoring team Ethan> that escalates every single error to completion, often to me. As a Ethan> result it pays to be able to state firmly whether they should Ethan> disregard certain errors (assuming a certain threshold for Ethan> frequency, etc.). The timeout/reset error in particular seems Ethan> harmless but, with more than twenty netapps in use, occurs just Ethan> often enough to be irritating.

In general, an occasional timeout should not be something to lose sleep over. Some of the more common causes of timeouts include

(1) In FC configurations, a bit flip in an FC frame results in the frame being dropped (all frames are CRC protected, so a CRC mismatch will result in the corresponding frame being discarded) leading to a timeout. (2) A disk drive is doing "deep" recovery on a marginally written sector, heroically attempting to recover the data by doing many retries.

Hope this helps,

guy＠netapp.com

7:16 p.m.

New subject: ISP_2100: cmd opcode 0x28, retry count 0: command timeout, aborting request error

...

I'm inclined to take small numbers of recoverable errors as normal operation, but I work in a company with a very noisy monitoring team that escalates every single error to completion, often to me. As a result it pays to be able to state firmly whether they should disregard certain errors (assuming a certain threshold for frequency, etc.). The timeout/reset error in particular seems harmless but, with more than twenty netapps in use, occurs just often enough to be irritating.

Arguably, any message that should be disregarded shouldn't be logged in the first place; should we simply suppress (unless some option is turned on to display them - probably some option less crude than cranking the "syslog" level up so it logs stuff at "debug" level, which causes tons of crap to be logged) messages that are either never significant or that haven't occurred often enough to be significant?

Colin Johnston

8:38 p.m.

New subject: syslog logging

...

Arguably, any message that should be disregarded shouldn't be logged in the first place; should we simply suppress (unless some option is turned on to display them - probably some option less crude than cranking the "syslog" level up so it logs stuff at "debug" level, which causes tons of crap to be logged) messages that are either never significant or that haven't occurred often enough to be significant?

So are NetApp now saying that they are disregarding the disk errors from NetAppCache machines as syslog as of version 3.3 NetAppCache code does not log messages about such disk error events. Amazing disk errors went away after 3.2 to 3.3 code upgrade.

Guy - I know my quoteing may have been taken out of context but I think it is still relevent.

Colin Johnston SA PSINET UK

guy＠netapp.com

9:10 p.m.

New subject: syslog logging

...

...
Arguably, any message that should be disregarded shouldn't be logged in the first place; should we simply suppress (unless some option is turned on to display them - probably some option less crude than cranking the "syslog" level up so it logs stuff at "debug" level, which causes tons of crap to be logged) messages that are either never significant or that haven't occurred often enough to be significant?

So are NetApp now saying that they are disregarding the disk errors from NetAppCache machines as syslog as of version 3.3 NetAppCache code does not log messages about such disk error events.

To which disk error events are you referring? The NetApp disk driver code, QLogic SCSI driver code, and QLogic FC-AL driver code in the NetCache 3.3 release does appear to have code that should log messages for at least *some* disk error events. If there are significant errors that aren't getting messages logged, that's a bug (and, if such a bug was introduced, said bug might also be in filer code, unless it was fixed in our code base after the NetCache 3.3 release forked off from the main code line).

Colin Johnston

9:25 p.m.

New subject: syslog logging

...

To which disk error events are you referring? The NetApp disk driver code, QLogic SCSI driver code, and QLogic FC-AL driver code in the NetCache 3.3 release does appear to have code that should log messages for at least *some* disk error events. If there are significant errors that aren't getting messages logged, that's a bug (and, if such a bug was introduced, said bug might also be in filer code, unless it was fixed in our code base after the NetCache 3.3 release forked off from the main code line).

see below for 3.2 code disk errors, 3.3 does not seem to log these errors we were always getting such errors below at least once a week and told by our SE to ignore them as they were not significant. see below for example - once 3.3 code was installed errors vanished and I am sure physical hardware errors do not vanish overnight

Tue Jul 21 12:33:21 GMT [isp_main]: Disk 9a.1(0x8721a0): READ sector 818536 recovered error (1 18, 80)

Obviously it is very hard to prove 3.3 does not log disk errors since all NetAppCache machines we have are now running 3.3 code rev.

Colin Johnston SA PSINET UK

guy＠netapp.com

9:38 p.m.

New subject: syslog logging

...

see below for 3.2 code disk errors, 3.3 does not seem to log these errors we were always getting such errors below at least once a week and told by our SE to ignore them as they were not significant. see below for example - once 3.3 code was installed errors vanished and I am sure physical hardware errors do not vanish overnight

Tue Jul 21 12:33:21 GMT [isp_main]: Disk 9a.1(0x8721a0): READ sector 818536 recovered error (1 18, 80)

It appears that the SCSI driver (the part of the code that handles the SCSI protocol, whether over Boring Old Parallel SCSI or over Fibre Channel) was changed not to log SCSI "recovered errors"; in the ONTAP code base, it appears that change was made between the 5.1[.x] and 5.2 releases (i.e., 5.1[.x] appear to log them, and 5.2 and later appear not to log them), and in the NetCache code base, it appears that change was made between 3.2[.x] and 3.3.

The errors not logged in 5.2-and-later ONTAP releases and 3.3-and-later NetCache releases are:

unit attention

recovered error

not ready, in ONTAP 5.3 and later.

Other errors are logged.

I'm not a SCSI expert, so you should ask one (hi, Radek!) if you have more detailed questions about the rationale for this.

Colin Johnston

1 Aug 1 Aug

10:49 p.m.

New subject: syslog logging

Hi Guy, does 3.4 NetCache release add back the syslog fucntionality for recovered disk errors ?? I have not seen any errors logs so far and I cannot belive that somehow a disk manages to fix itself

Colin Johnston SA PSINET UK

On Tue, 18 May 1999, Guy Harris wrote:

...

...
see below for 3.2 code disk errors, 3.3 does not seem to log these errors we were always getting such errors below at least once a week and told by our SE to ignore them as they were not significant. see below for example - once 3.3 code was installed errors vanished and I am sure physical hardware errors do not vanish overnight

Tue Jul 21 12:33:21 GMT [isp_main]: Disk 9a.1(0x8721a0): READ sector 818536 recovered error (1 18, 80)

It appears that the SCSI driver (the part of the code that handles the SCSI protocol, whether over Boring Old Parallel SCSI or over Fibre Channel) was changed not to log SCSI "recovered errors"; in the ONTAP code base, it appears that change was made between the 5.1[.x] and 5.2 releases (i.e., 5.1[.x] appear to log them, and 5.2 and later appear not to log them), and in the NetCache code base, it appears that change was made between 3.2[.x] and 3.3.

The errors not logged in 5.2-and-later ONTAP releases and 3.3-and-later NetCache releases are:

unit attention

recovered error

not ready, in ONTAP 5.3 and later.

Other errors are logged.

I'm not a SCSI expert, so you should ask one (hi, Radek!) if you have more detailed questions about the rationale for this.

scw＠seas.ucla.edu

18 May 18 May

9:08 p.m.

New subject: ISP_2100: cmd opcode 0x28, retry count 0: command timeout, aborting request error

...

...
I'm inclined to take small numbers of recoverable errors as normal operation, but I work in a company with a very noisy monitoring team that escalates every single error to completion, often to me. As a result it pays to be able to state firmly whether they should disregard certain errors (assuming a certain threshold for frequency, etc.). The timeout/reset error in particular seems harmless but, with more than twenty netapps in use, occurs just often enough to be irritating.

Arguably, any message that should be disregarded shouldn't be logged in the first place; should we simply suppress (unless some option is turned on to display them - probably some option less crude than cranking the "syslog" level up so it logs stuff at "debug" level, which causes tons of crap to be logged) messages that are either never significant or that haven't occurred often enough to be significant?

ARGHHHHH!!!!!! Any message is significant, the error history of a device/bus/system is VERY important. Drives that generate a soft (recoverable) error every so often are expected (we have one that's been doing it for years). Drives that have never had any sort of errors that start to produce errors are suspect. Changes in behaivior indicate that something has changed and needs to be looked at.

Hiding minor errors leads to end users living in Cloud-Cuckoo land, "Everything is fine, nothing can go wrong<click>wrong<click>wrong."

Please don't hide this message also (as you supressed the Disk soft error message).

----- Stephen C. Woods; UCLA SEASnet; 2567 Boelter hall; LA CA 90095; (310)-825-8614 Finger for public key scw@cirrus.seas.ucla.edu,Internet mail:scw@SEAS.UCLA.EDU

guy＠netapp.com

11:55 p.m.

New subject: ISP_2100: cmd opcode 0x28, retry count 0: command timeout, aborting request error

...

Any message is significant,

A significant message shouldn't be disregarded.

...

the error history of a device/bus/system is VERY important. Drives that generate a soft (recoverable) error every so often are expected (we have one that's been doing it for years). Drives that have never had any sort of errors that start to produce errors are suspect.

Which means that logging recoverable errors in the messages file, by itself, may not be what's wanted; what's wanted may be

1) a log you *can* get at, if you want, showing the full history;

2) something that detects a significant increase in the number of recoverable errors, i.e. "[a change] in behavior", and starts yelling and screaming when that happens (heck, perhaps it should, if you have a spare, *fail the drive* once the recoverable error rate goes above a certain level).

G D Geen

19 May 19 May

3:06 p.m.

New subject: ISP_2100: cmd opcode 0x28, retry count 0: command timeout, aborting request error

Sounds to me like you wish the admin. to know his equipment. What a cleaver idea. It took me a while to learn the system and many calls to the support line where I was told that I should not worry about a single incident on this or that message. I dedicated a this information to memeory and can diagnose problems before I call the support line for confirmation. -gdg

scw@seas.ucla.edu wrote:

...

...
...
I'm inclined to take small numbers of recoverable errors as normal operation, but I work in a company with a very noisy monitoring team that escalates every single error to completion, often to me. As a result it pays to be able to state firmly whether they should disregard certain errors (assuming a certain threshold for frequency, etc.). The timeout/reset error in particular seems harmless but, with more than twenty netapps in use, occurs just often enough to be irritating.

Arguably, any message that should be disregarded shouldn't be logged in the first place; should we simply suppress (unless some option is turned on to display them - probably some option less crude than cranking the "syslog" level up so it logs stuff at "debug" level, which causes tons of crap to be logged) messages that are either never significant or that haven't occurred often enough to be significant?

ARGHHHHH!!!!!! Any message is significant, the error history of a device/bus/system is VERY important. Drives that generate a soft (recoverable) error every so often are expected (we have one that's been doing it for years). Drives that have never had any sort of errors that start to produce errors are suspect. Changes in behaivior indicate that something has changed and needs to be looked at.
Hiding minor errors leads to end users living in Cloud-Cuckoo
land, "Everything is fine, nothing can go wrong<click>wrong<click>wrong."
Please don't hide this message also (as you supressed the Disk
soft error message).

Stephen C. Woods; UCLA SEASnet; 2567 Boelter hall; LA CA 90095; (310)-825-8614 Finger for public key scw@cirrus.seas.ucla.edu,Internet mail:scw@SEAS.UCLA.EDU

-- --------------------------------------------------------------- G D Geen mailto:geen@ti.com Texas Instruments Phone : (972)480.7896 System Administrator FAX : (972)480.7676 --------------------------------------------------------------- Life is what happens while you're busy making other plans. -J. Lennon

9491

Age (days ago)

9567

Last active (days ago)

toasters@lists.teaparty.net

11 comments

6 participants

tags (0)

participants (6)

Colin Johnston
Ethan Torretta
G D Geen
guy＠netapp.com
raster＠netapp.com
scw＠seas.ucla.edu