I don't have any brilliant suggestions here ... but I am intensely interested to hear how one can narrow down the fault domain ... as we intermittently see a similar set of symptoms in our own gear.

Heads <===> Fabric <===> Backend

So the Backend isn't servicing requests as quickly as the clients would like ... but why?

(a) Backend:  "Not enough spindles" for the load ... the queues on the backend grow to be sufficiently large (or were dropping requests, requiring retransmits) that IOs were taking a long time to complete, so long that timers on the VMWare hosts (and their guests) noticed and complained.  Similarly, CPs were taking a long time to complete.

(b) Frontend:  The frontend fumbled locks on the backend, e.g., laying down a write lock on a block, 'forgetting' to release it, trying to lock that same block again ... and having to go through a preempt routine before straightening things out ... IOs stalled for a while, timers on the VMWare hosts and guests noticed and complained ...

(c) Fabric:  The switch fabric in between front & back dropped frames (e.g. physical layer error) ... [in Fletcher's case, perhaps there is no Fabric, merely point-to-point links ... but conceptually this is a possible component]

(d) I suspect many others ...

So, for example, wafl_cp_toolong merely tells us that the Backend isn't servicing write requests as quickly as the Frontend wants (high-latency) ... but it doesn't tell us why.  How does one drill down into what is happening between Front & Back?

==> Is there a tool which records lock activity?
==> How does one insert a 'sniffer' into the path between Front & Back to capture FC (or SAS) traffic?

--sk

On 1/15/2013 1:12 PM, Fletcher Cocquyt wrote:
resending this without the 80kb chart

Yesterday morning one of the heads on our 3270 experienced large NFS latency spikes causing our VMware hosts and their VMs to log storage timeouts.
This latency does not correlate to any external metrics like CPU, network, OPS etc.