Hello Toasters.
I've been using a script to monitor disk failures on our filers. It's worked pretty well in the past, but I've hit a snag today...
Normally I can tell if a disk has failed in one of two ways:
1) Check the SNMP "disks.failed" value, if !eq 1, we've got a problem.
2) Check via SNMP, the total number of disks, then check the number of active disks, and add to it the number of spare disks. If total disks !eq to active plus spare disk, then there is a failure.
The second method is often needed for the failures that "slip through the cracks", and is sufficent. However today a disk failed and in doing so was removed from the TOTAL disk count, and not added as a failed disk, so in a case where I have 84 disks, I get the following output from my script:
The global message is: "The system's global status is normal. "
Total disk count is: 83
The Active Disk Count is: 82
The Spare Disk Count is: 1
The Failed Disk Count is: 0
I've omitted the non-relevent output from my script, which does simple snmpget's from the Net-SNMP app. Anyway, three things went wrong here,
the global message doesn't reflect a failure. The total disk count was decremented when it should not have and the failed disk count is still eq to zero.
Now, because I know that all our filers should always have 2 spare disks I can rewrite my scripts to look for this, however I had previously written them in a very portable way, which hardcoding the number of spares to look for will degridate...
Any ideas why this happens? I'm sure I'm not the only one who's seen this.
Ben Rockwood
brockwood(a)homestead-inc.com