One of our 6210 Filers running 8.1.3P1 stalled I/O for approx 1 minute (11:08:27 -11:09:24)
We saw this as latency of 5 seconds on the VMware hosts attached to the Filer via NFS and eventually "All Paths Down" messages on the ESX hosts.
I also saw warnings in the messages file:
Tue Oct 22 11:09:24 BST [TOASTER1: NwkThd_01:warning]: NFS response to client x.x.x.x for volume 0x5834a5c(vol004) was slow, op was v3 write, 69 > 60 (in seconds)
From looking in DFM Filer Summary view I see a "Z" shape in the graph for
most of the counters on the Filer e.g. CPU, Network Throughput, All Protocol Ops. The counters dip low then rapidly increase and tail off again. (see attached JPG)
During this time it all of the ESX hosts saw timeouts to the NFS datastores.
I checked the disk_busy on the only aggregate (90 x 15K SAS 450GB) on the Filer and it only shows the disks as 30-40% busy and the disk_busy drops during the time the Filer stalled. It seems odd as the overall load on the Filer didn't increase to precipitate this.
From previous experience of performance cases with Netapp, we're usually
asked to gather a perfstat next time it happens but this isn't possible as it's unpredictable and short lived.
The thing that concerns me is from previous experience this is usually a precursor to the Filer stalling for much longer periods causing much more impact. In the past we've found the only option is to upgrade the Filer head.
I would appreciate any pointers on how to identify the root cause of this.
-- View this message in context: Sent from the Network Appliance - Toasters mailing list archive at