This could be related to a similar situation we saw recently with a
failed disk. We don't use SNMP but the filer email notification facility.
Anyway, a disk failed and was subsequently removed by the OS and didn't
generate an email. So essentially the system didn't know that the disk
ever existed - similar to what you describe. I placed a call to NetApp
and they said this shouldn't ever happen(the OS automatically removing
a failed disk) and that it is a bug that will be fixed in a later release.
Check your system log for a disk failure message followed immediately
by a removal message for the failed disk. If it's there then you've
probably got the bug.
Regards,
_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/
_/ Don Cunningham - UNIX SysAdmin _/ Progress Software Corp. _/
_/ don.cunningham(a)progress.com _/ 14 Oak Park _/
_/ (781)280-4252 _/ Bedford, MA 01730-1485 _/
_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/
_/ "Wherever beer is brewed, all is well - whenever beer is _/
_/ drunk, life is good." -Czech proverb _/
_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/
>
> Hello Toasters.
>
> I've been using a script to monitor disk failures on our filers. It's
worked pretty well in the past, but I've hit a snag today...
>
> Normally I can tell if a disk has failed in one of two ways:
> 1) Check the SNMP "disks.failed" value, if !eq 1, we've got a problem.
> 2) Check via SNMP, the total number of disks, then check the number of
active disks, and add to it the number of spare disks. If total disks !eq
to active plus spare disk, then there is a failure.
>
> The second method is often needed for the failures that "slip through the
cracks", and is sufficent. However today a disk failed and in doing so was
removed from the TOTAL disk count, and not added as a failed disk, so in a
case where I have 84 disks, I get the following output from my script:
>
> The global message is: "The system's global status is normal. "
> Total disk count is: 83
> The Active Disk Count is: 82
> The Spare Disk Count is: 1
> The Failed Disk Count is: 0
>
>
> I've omitted the non-relevent output from my script, which does simple
snmpget's from the Net-SNMP app. Anyway, three things went wrong here,
> the global message doesn't reflect a failure. The total disk count was
decremented when it should not have and the failed disk count is still eq
to zero.
> Now, because I know that all our filers should always have 2 spare disks
I can rewrite my scripts to look for this, however I had previously written
them in a very portable way, which hardcoding the number of spares to look
for will degridate...
>
> Any ideas why this happens? I'm sure I'm not the only one who's seen
this.
>
> Ben Rockwood
> brockwood(a)homestead-inc.com