I'm trying to use mrtg to show the CPU utilization on a Network Appliance. Here's the relevant portion of the netapp MIB (output of an SNMPwalk, with my notes on the right side):
1.3.6.1.4.1.789.1.2.1.1.0: TIMETICKS: 1016121072 (uptime in ticks {1/100sec}) 1.3.6.1.4.1.789.1.2.1.2.0: TIMETICKS: 33513506 (cpu ticks used since uptime) 1.3.6.1.4.1.789.1.2.1.3.0: INTEGER: 3 (% cpu util since uptime) 1.3.6.1.4.1.789.1.2.1.4.0: TIMETICKS: 982607568 (cpu idle ticks since uptime) 1.3.6.1.4.1.789.1.2.1.5.0: INTEGER: 97 (% idle since last boot) 1.3.6.1.4.1.789.1.2.2.1.0: INTEGER: 311445413 (nfs ops since boot) 1.3.6.1.4.1.789.1.2.2.2.0: INTEGER: 4503517 (total kbytes recd on all nets) 1.3.6.1.4.1.789.1.2.2.3.0: INTEGER: 1224614 (total kbytes trans since boot)
I can't use the "% cpu util since uptime" figure because that's just an average cpu utilization since the last reboot, which isn't particularly informative when your server is usually up for months at a time.
So I figured I'd use the "cpu ticks used since uptime" and "cpu idle ticks since uptime" MIB entries.
The problem is that SNMP entry is of type "TIMETICKS", which is interpreted by the BER.pm module as an "uptime statistic". When I run that one, here's what the mrtg logfile looks like afterwards:
893639038 9 days, 8:48:06 9 days, 8:48:06 893639038 0 0 0 0 893638500 0 0 0 0 893638200 0 0 0 0
Ouch! The second time I run mrtg it fails with a "corrupt logfile" error (for good reason, that's corrupt all right!).
Here's the output of mrtg with the DEBUG level set at 8:
SNMPGET OID: 1.3.6.1.2.1.1.3.0 SNMPGET OID: 1.3.6.1.2.1.1.5.0 1.3.6.1.4.1.789.1.2.1.4.0&1.3.6.1.4.1.789.1.2.1.4.0:public@fs11 --> in: 9 days, 8:48:06 out: 9 days, 8:48:06 name: fs11.nas.nasa.gov getting SNMP variables for target: 1:public@fs12 snmpget: ifInOctets.1 ifOutOctets.1 fs12 public SNMPGET OID: 1.3.6.1.2.1.2.2.1.10.1 SNMPGET OID: 1.3.6.1.2.1.2.2.1.16.1
Now, here's the part of the BER.pm source code which decodes the SNMP reply:
sub pretty_print ($) { my ($packet) = @_; my ($type,$rest); my $result = ord (substr ($packet, 0, 1)); return pretty_intlike ($packet) if $result == int_tag; return pretty_unsignedlike ($packet) if $result == snmp_counter32_tag || $result == snmp_gauge32_tag; return pretty_string ($packet) if $result == octet_string_tag; return pretty_oid ($packet) if $result == object_id_tag; ###THIS LINE return pretty_uptime ($packet) if $result == uptime_tag; ###THIS LINE return pretty_ip_address ($packet) if $result == snmp_ip_address_tag; return "(null)" if $result == null_tag; return "#<unprintable BER type $result>"; }
The "pretty_uptime" subroutine is about what you'd expect, converting the ticks to "up for 4 days 3 hours..." format.
So, if the result is an "uptime_tag" (which probably means the same thing as TIMETICKS), the BER.pm module formats it as an uptime string before it even gives it to mrtg.
That's exactly what we want for the "this router has been up for ...." mesages on our webpages, but makes Netapp CPU information unavailable (except for the uptime since the last boot).
I really want to avoid maintaining homegrown customizations to mrtg. I'd prefer a solution to be incorporated into the main mrtg source code. Here are some suggestions on a fix:
1) Have mrtg spawn a separate script which uses snmpget, and then reports this information as an integer (Peter Buschman recently send such a script to the toaster's alias...I was wondering why he didn't just have mrtg get the data directly...now I know).
2) Have the NetApp MIB report CPU ticks as an INTEGER or COUNTER instead of TIMETICKS
3) Have BER.pm report uptimes as an integer, and have the mrtg perl script do the "pretty_uptime" conversion if appropriate (ie. for "router uptime" information at the top of the webpage, but not for data put in graphs).
4) ??? Any suggestions anyone?
I'm probably going to go with #1 unless someone has a better idea, but I thought this would be of sufficient interest that I'd spam you guys with it.
Darrell Root rootd@nas.nasa.gov