It sounds like you experienced "The Dead Cat bounce" I believe vmware suggests lowering the queue depth on your esxi host to 64.
-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Milazzo Giacomo Sent: Tuesday, October 22, 2013 10:22 AM To: Martin; toasters@teaparty.net Subject: R: Netapp FAS610 8.1.3P1 stalling
It remember me something happened few weeks ago to a customer of mine. You've got a bug but your version seems to be a fixing one
https://forums.netapp.com/thread/19352
-----Messaggio originale----- Da: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Per conto di Martin Inviato: martedì 22 ottobre 2013 16.06 A: toasters@teaparty.net Oggetto: Netapp FAS610 8.1.3P1 stalling
One of our 6210 Filers running 8.1.3P1 stalled I/O for approx 1 minute (11:08:27 -11:09:24)
We saw this as latency of 5 seconds on the VMware hosts attached to the Filer via NFS and eventually "All Paths Down" messages on the ESX hosts.
I also saw warnings in the messages file:
Tue Oct 22 11:09:24 BST [TOASTER1: NwkThd_01:warning]: NFS response to client x.x.x.x for volume 0x5834a5c(vol004) was slow, op was v3 write, 69 > 60 (in seconds)
From looking in DFM Filer Summary view I see a "Z" shape in the graph for most of the counters on the Filer e.g. CPU, Network Throughput, All Protocol Ops. The counters dip low then rapidly increase and tail off again. (see attached JPG) http://network-appliance-toasters.10978.n7.nabble.com/file/n25314/Filer-Z-shape.jpg
During this time it all of the ESX hosts saw timeouts to the NFS datastores.
I checked the disk_busy on the only aggregate (90 x 15K SAS 450GB) on the Filer and it only shows the disks as 30-40% busy and the disk_busy drops during the time the Filer stalled. It seems odd as the overall load on the Filer didn't increase to precipitate this.
http://network-appliance-toasters.10978.n7.nabble.com/file/n25314/Filer-Z-shape-disks.jpg
From previous experience of performance cases with Netapp, we're usually asked to gather a perfstat next time it happens but this isn't possible as it's unpredictable and short lived.
The thing that concerns me is from previous experience this is usually a precursor to the Filer stalling for much longer periods causing much more impact. In the past we've found the only option is to upgrade the Filer head.
I would appreciate any pointers on how to identify the root cause of this.
-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/Netapp-FAS610-8-1-3P1-... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com. _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters