Can anyone shed some light on the error below? It looks like it is just a media error which was recovered from, but I wanted verification. Thank you, Mike
Wed May 12 00:13:03 EDT [isp2100_timeout]: 8.45 (cmd opcode 0x28, retry count 0): command timeout, aborting request Wed May 12 00:13:03 EDT [isp2100_timeout]: isp2100_reset_device: device 8.45 (1.45)
Hi Mike,
The above message sequence is a symptom of SCSI command timeout. As part of timeout recovery, the device in question is reset and all pending commands are reissued.
Hope this helps.
On Mon, 17 May 1999, Radek Aster wrote:
Wed May 12 00:13:03 EDT [isp2100_timeout]: 8.45 (cmd opcode 0x28, retry count 0): command timeout, aborting request Wed May 12 00:13:03 EDT [isp2100_timeout]: isp2100_reset_device: device 8.45 (1.45)
The above message sequence is a symptom of SCSI command timeout. As part of timeout recovery, the device in question is reset and all pending commands are reissued.
How would you characterize the severity of the message, though? I'm inclined to take small numbers of recoverable errors as normal operation, but I work in a company with a very noisy monitoring team that escalates every single error to completion, often to me. As a result it pays to be able to state firmly whether they should disregard certain errors (assuming a certain threshold for frequency, etc.). The timeout/reset error in particular seems harmless but, with more than twenty netapps in use, occurs just often enough to be irritating.
ejt
"Ethan" == Ethan Torretta ethantor@corp.webtv.net writes:
Ethan> How would you characterize the severity of the message, though? I'm Ethan> inclined to take small numbers of recoverable errors as normal Ethan> operation, but I work in a company with a very noisy monitoring team Ethan> that escalates every single error to completion, often to me. As a Ethan> result it pays to be able to state firmly whether they should Ethan> disregard certain errors (assuming a certain threshold for Ethan> frequency, etc.). The timeout/reset error in particular seems Ethan> harmless but, with more than twenty netapps in use, occurs just Ethan> often enough to be irritating.
In general, an occasional timeout should not be something to lose sleep over. Some of the more common causes of timeouts include
(1) In FC configurations, a bit flip in an FC frame results in the frame being dropped (all frames are CRC protected, so a CRC mismatch will result in the corresponding frame being discarded) leading to a timeout. (2) A disk drive is doing "deep" recovery on a marginally written sector, heroically attempting to recover the data by doing many retries.
Hope this helps,
I'm inclined to take small numbers of recoverable errors as normal operation, but I work in a company with a very noisy monitoring team that escalates every single error to completion, often to me. As a result it pays to be able to state firmly whether they should disregard certain errors (assuming a certain threshold for frequency, etc.). The timeout/reset error in particular seems harmless but, with more than twenty netapps in use, occurs just often enough to be irritating.
Arguably, any message that should be disregarded shouldn't be logged in the first place; should we simply suppress (unless some option is turned on to display them - probably some option less crude than cranking the "syslog" level up so it logs stuff at "debug" level, which causes tons of crap to be logged) messages that are either never significant or that haven't occurred often enough to be significant?
Arguably, any message that should be disregarded shouldn't be logged in the first place; should we simply suppress (unless some option is turned on to display them - probably some option less crude than cranking the "syslog" level up so it logs stuff at "debug" level, which causes tons of crap to be logged) messages that are either never significant or that haven't occurred often enough to be significant?
So are NetApp now saying that they are disregarding the disk errors from NetAppCache machines as syslog as of version 3.3 NetAppCache code does not log messages about such disk error events. Amazing disk errors went away after 3.2 to 3.3 code upgrade.
Guy - I know my quoteing may have been taken out of context but I think it is still relevent.
Colin Johnston SA PSINET UK
Arguably, any message that should be disregarded shouldn't be logged in the first place; should we simply suppress (unless some option is turned on to display them - probably some option less crude than cranking the "syslog" level up so it logs stuff at "debug" level, which causes tons of crap to be logged) messages that are either never significant or that haven't occurred often enough to be significant?
So are NetApp now saying that they are disregarding the disk errors from NetAppCache machines as syslog as of version 3.3 NetAppCache code does not log messages about such disk error events.
To which disk error events are you referring? The NetApp disk driver code, QLogic SCSI driver code, and QLogic FC-AL driver code in the NetCache 3.3 release does appear to have code that should log messages for at least *some* disk error events. If there are significant errors that aren't getting messages logged, that's a bug (and, if such a bug was introduced, said bug might also be in filer code, unless it was fixed in our code base after the NetCache 3.3 release forked off from the main code line).
To which disk error events are you referring? The NetApp disk driver code, QLogic SCSI driver code, and QLogic FC-AL driver code in the NetCache 3.3 release does appear to have code that should log messages for at least *some* disk error events. If there are significant errors that aren't getting messages logged, that's a bug (and, if such a bug was introduced, said bug might also be in filer code, unless it was fixed in our code base after the NetCache 3.3 release forked off from the main code line).
see below for 3.2 code disk errors, 3.3 does not seem to log these errors we were always getting such errors below at least once a week and told by our SE to ignore them as they were not significant. see below for example - once 3.3 code was installed errors vanished and I am sure physical hardware errors do not vanish overnight
Tue Jul 21 12:33:21 GMT [isp_main]: Disk 9a.1(0x8721a0): READ sector 818536 recovered error (1 18, 80)
Obviously it is very hard to prove 3.3 does not log disk errors since all NetAppCache machines we have are now running 3.3 code rev.
Colin Johnston SA PSINET UK
see below for 3.2 code disk errors, 3.3 does not seem to log these errors we were always getting such errors below at least once a week and told by our SE to ignore them as they were not significant. see below for example - once 3.3 code was installed errors vanished and I am sure physical hardware errors do not vanish overnight
Tue Jul 21 12:33:21 GMT [isp_main]: Disk 9a.1(0x8721a0): READ sector 818536 recovered error (1 18, 80)
It appears that the SCSI driver (the part of the code that handles the SCSI protocol, whether over Boring Old Parallel SCSI or over Fibre Channel) was changed not to log SCSI "recovered errors"; in the ONTAP code base, it appears that change was made between the 5.1[.x] and 5.2 releases (i.e., 5.1[.x] appear to log them, and 5.2 and later appear not to log them), and in the NetCache code base, it appears that change was made between 3.2[.x] and 3.3.
The errors not logged in 5.2-and-later ONTAP releases and 3.3-and-later NetCache releases are:
unit attention
recovered error
not ready, in ONTAP 5.3 and later.
Other errors are logged.
I'm not a SCSI expert, so you should ask one (hi, Radek!) if you have more detailed questions about the rationale for this.
Hi Guy, does 3.4 NetCache release add back the syslog fucntionality for recovered disk errors ?? I have not seen any errors logs so far and I cannot belive that somehow a disk manages to fix itself
Colin Johnston SA PSINET UK
On Tue, 18 May 1999, Guy Harris wrote:
see below for 3.2 code disk errors, 3.3 does not seem to log these errors we were always getting such errors below at least once a week and told by our SE to ignore them as they were not significant. see below for example - once 3.3 code was installed errors vanished and I am sure physical hardware errors do not vanish overnight
Tue Jul 21 12:33:21 GMT [isp_main]: Disk 9a.1(0x8721a0): READ sector 818536 recovered error (1 18, 80)
It appears that the SCSI driver (the part of the code that handles the SCSI protocol, whether over Boring Old Parallel SCSI or over Fibre Channel) was changed not to log SCSI "recovered errors"; in the ONTAP code base, it appears that change was made between the 5.1[.x] and 5.2 releases (i.e., 5.1[.x] appear to log them, and 5.2 and later appear not to log them), and in the NetCache code base, it appears that change was made between 3.2[.x] and 3.3.
The errors not logged in 5.2-and-later ONTAP releases and 3.3-and-later NetCache releases are:
unit attention
recovered error
not ready, in ONTAP 5.3 and later.
Other errors are logged.
I'm not a SCSI expert, so you should ask one (hi, Radek!) if you have more detailed questions about the rationale for this.
I'm inclined to take small numbers of recoverable errors as normal operation, but I work in a company with a very noisy monitoring team that escalates every single error to completion, often to me. As a result it pays to be able to state firmly whether they should disregard certain errors (assuming a certain threshold for frequency, etc.). The timeout/reset error in particular seems harmless but, with more than twenty netapps in use, occurs just often enough to be irritating.
Arguably, any message that should be disregarded shouldn't be logged in the first place; should we simply suppress (unless some option is turned on to display them - probably some option less crude than cranking the "syslog" level up so it logs stuff at "debug" level, which causes tons of crap to be logged) messages that are either never significant or that haven't occurred often enough to be significant?
ARGHHHHH!!!!!! Any message is significant, the error history of a device/bus/system is VERY important. Drives that generate a soft (recoverable) error every so often are expected (we have one that's been doing it for years). Drives that have never had any sort of errors that start to produce errors are suspect. Changes in behaivior indicate that something has changed and needs to be looked at.
Hiding minor errors leads to end users living in Cloud-Cuckoo land, "Everything is fine, nothing can go wrong<click>wrong<click>wrong."
Please don't hide this message also (as you supressed the Disk soft error message).
----- Stephen C. Woods; UCLA SEASnet; 2567 Boelter hall; LA CA 90095; (310)-825-8614 Finger for public key scw@cirrus.seas.ucla.edu,Internet mail:scw@SEAS.UCLA.EDU
Any message is significant,
A significant message shouldn't be disregarded.
the error history of a device/bus/system is VERY important. Drives that generate a soft (recoverable) error every so often are expected (we have one that's been doing it for years). Drives that have never had any sort of errors that start to produce errors are suspect.
Which means that logging recoverable errors in the messages file, by itself, may not be what's wanted; what's wanted may be
1) a log you *can* get at, if you want, showing the full history;
2) something that detects a significant increase in the number of recoverable errors, i.e. "[a change] in behavior", and starts yelling and screaming when that happens (heck, perhaps it should, if you have a spare, *fail the drive* once the recoverable error rate goes above a certain level).
Sounds to me like you wish the admin. to know his equipment. What a cleaver idea. It took me a while to learn the system and many calls to the support line where I was told that I should not worry about a single incident on this or that message. I dedicated a this information to memeory and can diagnose problems before I call the support line for confirmation. -gdg
scw@seas.ucla.edu wrote:
I'm inclined to take small numbers of recoverable errors as normal operation, but I work in a company with a very noisy monitoring team that escalates every single error to completion, often to me. As a result it pays to be able to state firmly whether they should disregard certain errors (assuming a certain threshold for frequency, etc.). The timeout/reset error in particular seems harmless but, with more than twenty netapps in use, occurs just often enough to be irritating.
Arguably, any message that should be disregarded shouldn't be logged in the first place; should we simply suppress (unless some option is turned on to display them - probably some option less crude than cranking the "syslog" level up so it logs stuff at "debug" level, which causes tons of crap to be logged) messages that are either never significant or that haven't occurred often enough to be significant?
ARGHHHHH!!!!!! Any message is significant, the error history of a device/bus/system is VERY important. Drives that generate a soft (recoverable) error every so often are expected (we have one that's been doing it for years). Drives that have never had any sort of errors that start to produce errors are suspect. Changes in behaivior indicate that something has changed and needs to be looked at.
Hiding minor errors leads to end users living in Cloud-Cuckoo
land, "Everything is fine, nothing can go wrong<click>wrong<click>wrong."
Please don't hide this message also (as you supressed the Disk
soft error message).
Stephen C. Woods; UCLA SEASnet; 2567 Boelter hall; LA CA 90095; (310)-825-8614 Finger for public key scw@cirrus.seas.ucla.edu,Internet mail:scw@SEAS.UCLA.EDU