False positives on filer/protocol being down are not uncommon in DFM.
If you go to your DFM server and kick off an infinite ping, do you get any lost ping packets? Especially around the time the service down errors start?
These errors occur when the DFM server cannot contact the filer via SNMP or ping, so there could be some network flakiness going on.
Also useful would be a packet trace during one of the “filer down” messages. That will show if the packets are making it back to the DFM server. But that can
be hard to catch in-flight.
I’d start there – if you don’t see anything readily apparent, you should open up a case and give them the data mentioned, as well as a dfmdc. That will at least
get the ball rolling…
From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net]
On Behalf Of Klise, Steve
Sent: Wednesday, March 20, 2013 4:38 PM
To: toasters@teaparty.net Lists
Subject: DFM/oncommand alerting
Has anyone experienced false positives for filer, and/or protocol being down? I have been getting these errors and I have upgraded to 5.1.. I was getting the
errors prior to 5.1 upgrade as well..
This is an example of the alert.
A Warning event at 20 Mar 13:00 Pacific Daylight Time on Active/Active Controller filer1.root.filer…:
Host NFS service is down.
NFS service is down on host filer1.root.filer (10195).
This one gets my ticker going….
J
A Critical event at 20 Mar 12:30 Pacific Daylight Time on Active/Active Controller filer3.root.filer:
The Active/Active Controller is down.
I have this happen intermittently and it’s not filer specific. Nothing in the messages file; all good.. No autosupport that goes out, so wondering if it’s
worth opening up a month log case with Netapp..
All flavors of DOT are 8.1x and DFM is monitoring about 5 HA pairs.
Thanks