Hello Toasters.
I've been using a script to monitor disk failures on our filers. It's worked pretty well in the past, but I've hit a snag today...
Normally I can tell if a disk has failed in one of two ways: 1) Check the SNMP "disks.failed" value, if !eq 1, we've got a problem. 2) Check via SNMP, the total number of disks, then check the number of active disks, and add to it the number of spare disks. If total disks !eq to active plus spare disk, then there is a failure.
The second method is often needed for the failures that "slip through the cracks", and is sufficent. However today a disk failed and in doing so was removed from the TOTAL disk count, and not added as a failed disk, so in a case where I have 84 disks, I get the following output from my script:
The global message is: "The system's global status is normal. " Total disk count is: 83 The Active Disk Count is: 82 The Spare Disk Count is: 1 The Failed Disk Count is: 0
I've omitted the non-relevent output from my script, which does simple snmpget's from the Net-SNMP app. Anyway, three things went wrong here, the global message doesn't reflect a failure. The total disk count was decremented when it should not have and the failed disk count is still eq to zero. Now, because I know that all our filers should always have 2 spare disks I can rewrite my scripts to look for this, however I had previously written them in a very portable way, which hardcoding the number of spares to look for will degridate...
Any ideas why this happens? I'm sure I'm not the only one who's seen this.
Ben Rockwood brockwood@homestead-inc.com
I've been using a script to monitor disk failures on our filers. It's worked pretty well in the past, but I've hit a snag today...
Normally I can tell if a disk has failed in one of two ways:
- Check the SNMP "disks.failed" value, if !eq 1, we've got a problem.
- Check via SNMP, the total number of disks, then check the number of
active disks, and add to it the number of spare disks. If total disks !eq to active plus spare disk, then there is a failure.
I'd just starting doing this for some other storage products, and this inspired me to start doing it for the filers as well. Almost every aspect seems really well reported (got a decent cricket snapmirror config now too), but there's one omission I've noticed -- vif's.
Namely, I'd like to have a vifTable that would include ifIndex's of the vif itself, its members, its type (single/multi), and status (broken, etc). A set of quick SNMP scripts like Ben described could also ensure proper operation (favored int active if single, even distribution if multi, etc)....
Anyone know if this is planned or had been intentionally passed by?
..kg..