Greetings,
We upgraded a pair of 6080s and 6040s from 7.3.3P5 to 7.3.5.1P4 at the first of the year. Within 1 hour of the upgrade, our OpenNMS server started getting timeouts for SNMP polling. We had ~130 per head for the 6080s in the next 2 days. We had 0 in the prior 2 months that I checked. We also noticed the average CPU load for each head went from ~40% to close to 60% utilization comparing December to January. Anytime the cluster is in a failover mode, we get quite a few customer complaints due to slowness. While I can't say for certain we did not have this slowness before, none of us remember any complaints when the cluster was failed before. Obviously trying to run 120% utilization on 1 filer instead of 80% will cause this issue :-) The 6040 cluster did not give the polling errors, but it looks like the CPU load is higher on them also. The load is usually low enough to gracefully cover a cluster failover.
I've had a case opened since the first of the year and it looks like we are now out of options and ideas with no explanation or resolution. At this point our plan is to roll back to the 7.3.3P5 OS since it seemed to behave better. Since there is no identifiable problem or solution, it makes us unwilling to jump to a newer OS since we have no reason to believe the issue isn't inherited by later revisions. Rolling back to a known good OS seems to make the most sense. We will roll the 6080 cluster back first and see how it works out before we roll the 6040 cluster.
My question to the group is whether anyone else has had a similar issue but did jump to a newer OnTap release that fixed the issues? The vast majority of our data access is via nfs from RHEL5 clients and servers.
Thanks,
Jeff