Keith,
We have currently set up to trap on 'miscGlobalStatus' as it seemed a
reasonable starter - The MIB description is a little woolly however :
"This indicates the overall status of the appliance. The algorithm to
determine the value uses both hardware status (e.g. the number of failed
fans) and volume status (e.g. number of volumes that are full). This may
change...etc"
This has values of other(1) unknown(2) ok(3) nonCritical(4) critical(5)
nonRecoverable(6)
Is there a more detailed explanation of what event will trigger what status
? We are considering configuring our notification system to respond
according to the value; i.e. email/phone on (4), 24hr page(5), print CV(6).
On testing, when we switch off 1 power supply out of 2, we get a
'critical'(5) trap. ( is 'nonCritical' more appropriate ?)
Other traps to possibly have set up as a 'starter' set : OverTemperature,
FailedFanCount, FailedPowerSupplyCount, dfPerCentKBytesCapacity.
Other monitoring issues :
We have configured syslog.conf to send message errors to 'messages.err',
critical to 'messages.crit' etc. hoping to be able to parse these files to
enable an appropriate response according to severity. However, when doing
the same power supply pull, we only get a message in the 'info' file. (This
is deemed of sufficient importance however to trigger autosupport email.)
So it appears we have to match on the message wording itself - an
incremental approach of matching on all known potential error messages -
time consuming.
In lieu of better classification of messages, has anybody got a list of
known filer error messages, with severity level ?
How do other users monitor filers ?
Regards,
Richard Moore
NortelNetworks
Harlow UK