Hi All,
I've been testing a few monitoring scripts for cDOT and had to pull out some physical components on heads / disk shelves, I noticed the following:
when a PSU removed from a DS2246 disk shelf, this fact is not visible for CLI tools like "alert" or "system health". Also Zephyr API calls of 'ses' category don't seem to have visibility on that.
So currently there is one disk shelf running on a single PSU and cDOT is quite happy about it.
The disk shelf has an amber light on and there was the following event log entry once the PSU was removed:
12/19/2014 11:05:15 na101node-1a WARNING ses.status.psWarning: DS2246 (S/N SHFHU1427000371) shelf 24 on channel 0b power warning for Power supply 1: not installed. This module is on the rear of the shelf at the bottom left.
It's running cDOT 8.2.2
Did anyone have similar experience? Is it expected behavior.
Cheers, Vladimir
I opened a ticket with NetApp support with no luck so far.
The only way I found to identify such an issue is the following:
node_shell> sysconfig -M ..
!DS2246!SHFHU1427000371!0173!! !DS2246-Pwr-Supply!<N/A>!<N/A>!<N/A>!<N/A>!! !DS2246-Pwr-Supply!XXT140532704!114-00065+A2!9C!020F!!
The 2nd line suggests there is no status info about 1st PSU.
Vladimir
On Tue, Dec 23, 2014 at 3:51 PM, Momonth momonth@gmail.com wrote:
Hi All,
I've been testing a few monitoring scripts for cDOT and had to pull out some physical components on heads / disk shelves, I noticed the following:
when a PSU removed from a DS2246 disk shelf, this fact is not visible for CLI tools like "alert" or "system health". Also Zephyr API calls of 'ses' category don't seem to have visibility on that.
No experience with CDOT yet (installing this spring), however generally, for any equipment, I'd be using SNMP for this. I'd have gone through the vendor's MIB file http://community.netapp.com/t5/Developer-Network-Articles-and-Resources/NetApp-SNMP-MIB-download-information/ta-p/85234 and ensured that the events (like power issues) I need to be notified about are above the notification threshold for my SNMP software. Also, I'd make sure that the box is able to call home and open tickets for critical issues. I've had the vendor help me with that in the past.
On Wed, Dec 24, 2014 at 8:12 AM, Momonth momonth@gmail.com wrote:
I opened a ticket with NetApp support with no luck so far.
The only way I found to identify such an issue is the following:
node_shell> sysconfig -M ..
!DS2246!SHFHU1427000371!0173!! !DS2246-Pwr-Supply!<N/A>!<N/A>!<N/A>!<N/A>!! !DS2246-Pwr-Supply!XXT140532704!114-00065+A2!9C!020F!!
The 2nd line suggests there is no status info about 1st PSU.
Vladimir
On Tue, Dec 23, 2014 at 3:51 PM, Momonth momonth@gmail.com wrote:
Hi All,
I've been testing a few monitoring scripts for cDOT and had to pull out some physical components on heads / disk shelves, I noticed the following:
when a PSU removed from a DS2246 disk shelf, this fact is not visible for CLI tools like "alert" or "system health". Also Zephyr API calls of 'ses' category don't seem to have visibility on that.
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
I'm currently using SNMP to monitor 7-M filers and I'm not entirely happy with it, eg SNMP agent doesn't answer on queries on a busy filer. I'd say the situaion is even worse with SNMP support in cDOT, see http://www.netapp.com/in/media/tr-4220.pdf for details.
I agree, "autosupport" is a "must".
On Wed, Dec 24, 2014 at 3:26 PM, Basil basilberntsen@gmail.com wrote:
No experience with CDOT yet (installing this spring), however generally, for any equipment, I'd be using SNMP for this. I'd have gone through the vendor's MIB file and ensured that the events (like power issues) I need to be notified about are above the notification threshold for my SNMP software. Also, I'd make sure that the box is able to call home and open tickets for critical issues. I've had the vendor help me with that in the past.
We've had some success using their API - I have python scripts that pull out network info and latency metrics into our graphite system. We are probably going to switch to this for our Nagios systems as we have seen the SNMP timeout under similar conditions.
I think you can get what you need from the diagnosis API, more specifically the diagnosis-alert-info, but I haven't tested it.
I do find that the API is much quicker than SNMP for bulk queries (all io stats for every volume, for example), and much more reliable. Its an area that rewards some time spent, IMHO..
On 06 Jan 13:53, Momonth wrote:
I'm currently using SNMP to monitor 7-M filers and I'm not entirely happy with it, eg SNMP agent doesn't answer on queries on a busy filer. I'd say the situaion is even worse with SNMP support in cDOT, see http://www.netapp.com/in/media/tr-4220.pdf for details.
I agree, "autosupport" is a "must".
On Wed, Dec 24, 2014 at 3:26 PM, Basil basilberntsen@gmail.com wrote:
No experience with CDOT yet (installing this spring), however generally, for any equipment, I'd be using SNMP for this. I'd have gone through the vendor's MIB file and ensured that the events (like power issues) I need to be notified about are above the notification threshold for my SNMP software. Also, I'd make sure that the box is able to call home and open tickets for critical issues. I've had the vendor help me with that in the past.
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
"John" == John Constable jc18@sanger.ac.uk writes:
John> We've had some success using their API - I have python scripts John> that pull out network info and latency metrics into our graphite John> system. We are probably going to switch to this for our Nagios John> systems as we have seen the SNMP timeout under similar John> conditions.
John> I think you can get what you need from the diagnosis API, more John> specifically the diagnosis-alert-info, but I haven't tested it.
John> I do find that the API is much quicker than SNMP for bulk John> queries (all io stats for every volume, for example), and much John> more reliable. Its an area that rewards some time spent, IMHO..
Care to share your work with the rest of us, so we can all benefit? Or give pointers to the docs we need to read? I've got a cDOT setup which is crying out to be monitored and since our new Nagios instance might not cut it from what you say, having other options would be ideal.
Thanks, John
On 06 Jan 16:17, John Stoffel wrote:
"John" == John Constable jc18@sanger.ac.uk writes:
John> We've had some success using their API - I have python scripts John> that pull out network info and latency metrics into our graphite John> system. We are probably going to switch to this for our Nagios John> systems as we have seen the SNMP timeout under similar John> conditions.
John> I think you can get what you need from the diagnosis API, more John> specifically the diagnosis-alert-info, but I haven't tested it.
John> I do find that the API is much quicker than SNMP for bulk John> queries (all io stats for every volume, for example), and much John> more reliable. Its an area that rewards some time spent, IMHO..
Care to share your work with the rest of us, so we can all benefit? Or give pointers to the docs we need to read? I've got a cDOT setup which is crying out to be monitored and since our new Nagios instance might not cut it from what you say, having other options would be ideal.
Sorry for the delay - I'm trying to work out if the combination of SDK, API and open source licences for the code mean I can post it (sigh). Plus, it needs a little documenting, but I do plan to send out details if I can..
On 06 Jan 16:17, John Stoffel wrote:
"John" == John Constable jc18@sanger.ac.uk writes:
John> We've had some success using their API - I have python scripts John> that pull out network info and latency metrics into our graphite John> system. We are probably going to switch to this for our Nagios John> systems as we have seen the SNMP timeout under similar John> conditions.
John> I think you can get what you need from the diagnosis API, more John> specifically the diagnosis-alert-info, but I haven't tested it.
John> I do find that the API is much quicker than SNMP for bulk John> queries (all io stats for every volume, for example), and much John> more reliable. Its an area that rewards some time spent, IMHO..
Care to share your work with the rest of us, so we can all benefit? Or give pointers to the docs we need to read? I've got a cDOT setup which is crying out to be monitored and since our new Nagios instance might not cut it from what you say, having other options would be ideal.
Probably waaaay to late now (and apologies for that), but I have put the code up on GitHub; https://github.com/kript/Monitoring-Stuff/blob/master/Generate_NetApp_Perf_s...
Which means I can use it like this;
python Generate_NetApp_Perf_stats.py get-counter-values volume nfs_write_ops nfs_read_ops nfs_write_latency nfs_read_latency nfs_other_ops nfs_other_latency
to push Graphite IO stats for every volume on a (JSON deliminated file list of NetApp 7-Mode units).
or this;
python Generate_NetApp_Perf_stats.py get-counter-values ifnet recv_packets recv_errors send_packets send_errors collisions recv_data send_data recv_mcast
To generate network stats for the same.
It runs every minute and puts negligible load on the filer, and returns quickly (unlike SNMP).
Its not really written for portability, but shouldn't be hard to convert. Commenting out line 327 and uncommenting 326 will show you it in action, and changing the CARBON_SERVER variable to your Graphite system should be enough, once you have created your list of 7-Mode systems (and setup their API user).
Feel free to fork/improve/ignore.. :)