Keith, We have currently set up to trap on 'miscGlobalStatus' as it seemed a reasonable starter - The MIB description is a little woolly however : "This indicates the overall status of the appliance. The algorithm to determine the value uses both hardware status (e.g. the number of failed fans) and volume status (e.g. number of volumes that are full). This may change...etc" This has values of other(1) unknown(2) ok(3) nonCritical(4) critical(5) nonRecoverable(6) Is there a more detailed explanation of what event will trigger what status ? We are considering configuring our notification system to respond according to the value; i.e. email/phone on (4), 24hr page(5), print CV(6). On testing, when we switch off 1 power supply out of 2, we get a 'critical'(5) trap. ( is 'nonCritical' more appropriate ?) Other traps to possibly have set up as a 'starter' set : OverTemperature, FailedFanCount, FailedPowerSupplyCount, dfPerCentKBytesCapacity.
Other monitoring issues : We have configured syslog.conf to send message errors to 'messages.err', critical to 'messages.crit' etc. hoping to be able to parse these files to enable an appropriate response according to severity. However, when doing the same power supply pull, we only get a message in the 'info' file. (This is deemed of sufficient importance however to trigger autosupport email.) So it appears we have to match on the message wording itself - an incremental approach of matching on all known potential error messages - time consuming. In lieu of better classification of messages, has anybody got a list of known filer error messages, with severity level ?
How do other users monitor filers ?
Regards,
Richard Moore NortelNetworks Harlow UK
Hi Richard...
Thanks for the "starter traps" suggestions.
Is there a more detailed explanation of what event will trigger what status?
I'll defer to Brian's reply on this.
On testing, when we switch off 1 power supply out of 2, we get a 'critical'(5) trap. ( is 'nonCritical' more appropriate ?)
I guess it's somewhat subjective, but I'd call the loss of a power supply a fairly "critical" event, even when there is another power supply happily keeping your system up for you. Just because the failure doesn't cause downtime doesn't necessarily mean it should automatically fall into the "nonCriticial" bracket, especially when you have a "nonRecoverable" bracket above "critical". To borrow a quote from Spinal Tap, the dial on this thing goes all the way to eleven! :-) That's my view anyway.
We have configured syslog.conf to send message errors to 'messages.err', critical to 'messages.crit' etc. hoping to be able to parse these files to enable an appropriate response according to severity. However, when doing the same power supply pull, we only get a message in the 'info' file.
Yep, this is clearly inconsistent. It looks like Brian has already filed the bug. Sorry about that.
Keith