Originally attempted to post June 22, 2010! - thanks for fixing the list -
We have since identified the issue by deconstructing the IOPS behind the latency spikes and resolved per:

http://www.vmadmin.info/2010/07/vmware-and-netapp-deconstructing.html

Hope it proves useful for anyone else with similar issues

On 6/22/10 10:19 AM, "Fletcher Cocquyt" <fcocquyt@stanford.edu> wrote:

Hi,
We have a 3040 cluster hosting 11 vSphere hosts with 200 VMs on NFS datastores.
We see latency spikes 3-4 times a month as reported by Operations Manager.

We hoped our upgrade from 7.3.1.1 last week to 7.3.3 would help, but we’ve had many spikes up to 1 second take out a NFS mount and all several of the VMs since going to 7.3.3

We previously  determined the High & medium IO VMs and either aligned them or migrated them to local disk - has NOT helped - still getting the spikes.

I have another case opened with Netapp.

Following the notes in this latency spike related thread,
http://communities.netapp.com/message/30657
 I ran the wafl_susp -w to check the pw.over_limit

Turns out ours is ZERO (is it relevant to NFS?)

I suspect an internal Netapp process is responsible for these (dedup?) - we had it disabled on 7.3.1.1 - 7.3.3 was supposed to fix this (we re-enabled de-dup after the upgrade)

And the latency spike outages are back

Will share any info from the case

thanks for any tips,

--
Fletcher Cocquyt
Principal Engineer
Information Resources and Technology (IRT)
Stanford University School of Medicine

http://vmadmin.info