Hi Steve,

1) Use netapp support - ask specifically if there are any relevant known bugs in your 8.0.1 version - ask for a reply by a certain time

2) Use performance advisor to help deconstruct where the finite number of IOPs are going: http://www.vmadmin.info/2010/07/vmware-and-netapp-deconstructing.html

2a) Do the disk busy view spikes strongly correlate to the latency view spikes ?

2b) Use PA to see your base line "normal" IOPS per aggr, also use the latency per IOPS chart to calculate your AGGRs IOPS capacity @ 20ms latency (VMs don't like much more than 20ms at peak)

3) Check unaligned IO is contributing to the latency spike by tracking partial writes: http://www.vmadmin.info/2010/07/quantifying-vmdk-misalignment.html

On May 24, 2012, at 5:28 AM, Steve Losen wrote:

Hello Toasters,

We had an incident where a FAS6040 running 8.0.1 gave very
slow response to a variety of requests. This resulted in
a bunch of Linux VMs experiencing failed disk I/O requests,
which resulted in corrupted local Linux filesystems. One
Linux VM logged that it waited 180 sec. for a disk I/O to
complete.

The filer did not reboot, and recovered on its own, but we
are very interested in figuring out what happened and if we
can avoid it (known bug?, should we upgrade ONTAP?)

This 6040 does a variety of jobs -- NFS server for a Communigate Pro
email system, NFS volumes for VMWARE, and it even has a few FC
SAN LUNs used by sharepoint. In general it does not appear to be
overloaded. It's been up for over 300 days with no problems.
The CF partner filer experienced no problems at that time and it
also holds VMWARE volumes and mail volumes.

At the time of the outage the /etc/messages file indicated a slow
NFS response (93 sec) to one of the Communigate servers. It also
indicated that a FC LUN was reset by the sharepoint server. I'm
guessing due to delayed response. The mail and sharepoint
volumes are in the same aggregate. I see three resets for the
LUN in /etc/messages.

I looked in /etc/log/ems at about the time of the outage
(Tue May 22 17:30 EDT) I see that a raid scrub of a raid
group in the mail/sharepoint aggregate completed with no errors.
I also see wafl_cp_toolong_warning_1 and wafl_cp_slovol_warning_1
for a different aggregate (which contains VMs). I see several
of these, which I presume are caused by slow completion of CPs.

I don't know if these caused the problem or were caused by the
problem.

Anyone have any suggestions for further investigation or diagnosis?
Any other logs to look at? Everything is fine now and has been
running fine since the outage.

Thanks,

Steve Losen scl@virginia.edu phone: 434-924-0640

University of Virginia ITC Unix Support

_______________________________________________
Toasters mailing list
Toasters@teaparty.net
http://www.teaparty.net/mailman/listinfo/toasters