One of our 6210 Filers running 8.1.3P1 stalled I/O for approx 1 minute (11:08:27 -11:09:24)
We saw this as latency of 5 seconds on the VMware hosts attached to the Filer via NFS and eventually "All Paths Down" messages on the ESX hosts.
I also saw warnings in the messages file:
Tue Oct 22 11:09:24 BST [TOASTER1: NwkThd_01:warning]: NFS response to client x.x.x.x for volume 0x5834a5c(vol004) was slow, op was v3 write, 69 > 60 (in seconds)
From looking in DFM Filer Summary view I see a "Z" shape in the graph for
most of the counters on the Filer e.g. CPU, Network Throughput, All Protocol Ops. The counters dip low then rapidly increase and tail off again. (see attached JPG) http://network-appliance-toasters.10978.n7.nabble.com/file/n25314/Filer-Z-shape.jpg
During this time it all of the ESX hosts saw timeouts to the NFS datastores.
I checked the disk_busy on the only aggregate (90 x 15K SAS 450GB) on the Filer and it only shows the disks as 30-40% busy and the disk_busy drops during the time the Filer stalled. It seems odd as the overall load on the Filer didn't increase to precipitate this.
http://network-appliance-toasters.10978.n7.nabble.com/file/n25314/Filer-Z-shape-disks.jpg
From previous experience of performance cases with Netapp, we're usually
asked to gather a perfstat next time it happens but this isn't possible as it's unpredictable and short lived.
The thing that concerns me is from previous experience this is usually a precursor to the Filer stalling for much longer periods causing much more impact. In the past we've found the only option is to upgrade the Filer head.
I would appreciate any pointers on how to identify the root cause of this.
-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/Netapp-FAS610-8-1-3P1-... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.
It remember me something happened few weeks ago to a customer of mine. You've got a bug but your version seems to be a fixing one
https://forums.netapp.com/thread/19352
-----Messaggio originale----- Da: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Per conto di Martin Inviato: martedì 22 ottobre 2013 16.06 A: toasters@teaparty.net Oggetto: Netapp FAS610 8.1.3P1 stalling
One of our 6210 Filers running 8.1.3P1 stalled I/O for approx 1 minute (11:08:27 -11:09:24)
We saw this as latency of 5 seconds on the VMware hosts attached to the Filer via NFS and eventually "All Paths Down" messages on the ESX hosts.
I also saw warnings in the messages file:
Tue Oct 22 11:09:24 BST [TOASTER1: NwkThd_01:warning]: NFS response to client x.x.x.x for volume 0x5834a5c(vol004) was slow, op was v3 write, 69 > 60 (in seconds)
From looking in DFM Filer Summary view I see a "Z" shape in the graph for
most of the counters on the Filer e.g. CPU, Network Throughput, All Protocol Ops. The counters dip low then rapidly increase and tail off again. (see attached JPG) http://network-appliance-toasters.10978.n7.nabble.com/file/n25314/Filer-Z-shape.jpg
During this time it all of the ESX hosts saw timeouts to the NFS datastores.
I checked the disk_busy on the only aggregate (90 x 15K SAS 450GB) on the Filer and it only shows the disks as 30-40% busy and the disk_busy drops during the time the Filer stalled. It seems odd as the overall load on the Filer didn't increase to precipitate this.
http://network-appliance-toasters.10978.n7.nabble.com/file/n25314/Filer-Z-shape-disks.jpg
From previous experience of performance cases with Netapp, we're usually
asked to gather a perfstat next time it happens but this isn't possible as it's unpredictable and short lived.
The thing that concerns me is from previous experience this is usually a precursor to the Filer stalling for much longer periods causing much more impact. In the past we've found the only option is to upgrade the Filer head.
I would appreciate any pointers on how to identify the root cause of this.
-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/Netapp-FAS610-8-1-3P1-... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com. _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
It sounds like you experienced "The Dead Cat bounce" I believe vmware suggests lowering the queue depth on your esxi host to 64.
-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Milazzo Giacomo Sent: Tuesday, October 22, 2013 10:22 AM To: Martin; toasters@teaparty.net Subject: R: Netapp FAS610 8.1.3P1 stalling
It remember me something happened few weeks ago to a customer of mine. You've got a bug but your version seems to be a fixing one
https://forums.netapp.com/thread/19352
-----Messaggio originale----- Da: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Per conto di Martin Inviato: martedì 22 ottobre 2013 16.06 A: toasters@teaparty.net Oggetto: Netapp FAS610 8.1.3P1 stalling
One of our 6210 Filers running 8.1.3P1 stalled I/O for approx 1 minute (11:08:27 -11:09:24)
We saw this as latency of 5 seconds on the VMware hosts attached to the Filer via NFS and eventually "All Paths Down" messages on the ESX hosts.
I also saw warnings in the messages file:
Tue Oct 22 11:09:24 BST [TOASTER1: NwkThd_01:warning]: NFS response to client x.x.x.x for volume 0x5834a5c(vol004) was slow, op was v3 write, 69 > 60 (in seconds)
From looking in DFM Filer Summary view I see a "Z" shape in the graph for most of the counters on the Filer e.g. CPU, Network Throughput, All Protocol Ops. The counters dip low then rapidly increase and tail off again. (see attached JPG) http://network-appliance-toasters.10978.n7.nabble.com/file/n25314/Filer-Z-shape.jpg
During this time it all of the ESX hosts saw timeouts to the NFS datastores.
I checked the disk_busy on the only aggregate (90 x 15K SAS 450GB) on the Filer and it only shows the disks as 30-40% busy and the disk_busy drops during the time the Filer stalled. It seems odd as the overall load on the Filer didn't increase to precipitate this.
http://network-appliance-toasters.10978.n7.nabble.com/file/n25314/Filer-Z-shape-disks.jpg
From previous experience of performance cases with Netapp, we're usually asked to gather a perfstat next time it happens but this isn't possible as it's unpredictable and short lived.
The thing that concerns me is from previous experience this is usually a precursor to the Filer stalling for much longer periods causing much more impact. In the past we've found the only option is to upgrade the Filer head.
I would appreciate any pointers on how to identify the root cause of this.
-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/Netapp-FAS610-8-1-3P1-... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com. _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
This does look very similar to the bug 393877 but this was meant to be fixed in 8.1.3 (we are running 8.1.3P1). Also mentioned on this VMware KB and thread:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd...
http://network-appliance-toasters.10978.n7.nabble.com/wafl-cp-slovol-warning...
These are the two bugs I was aware of:
393877 - inefficient pre-fetching of metadata blocks delays WAFL Consistency Point
599967 - Many concurrent deletions of large files can result in delays in CIFS or NFS operations (Not aware of any concurrent deletions)
I am thinking this is possibly another bug or the same one that isn't fixed in this situation as it doesn't look like a resource constraint.
I am going to try opening a case with Netapp but as I mentioned before don't hold out much hope with that.
Thanks Martin
-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/Netapp-FAS610-8-1-3P1-... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.
Yeah, I was thinking that after I replied, I was thinking you were on 8.1.2 P1
I had upgraded from 8.1.1 to 8.1.2P4 to resolve said issues.
I had to wait and press IBM to actually put out P4 as P3 was the latest available from IBM at the time. (N series)
I also enabled vmware SIOC limits which seemed to help a lot. Prior to the upgrade, nearly every vmotion not throttled with SIOC caused 393877 to be triggered.
-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Martin Sent: Tuesday, October 22, 2013 10:51 AM To: toasters@teaparty.net Subject: RE: Netapp FAS610 8.1.3P1 stalling
This does look very similar to the bug 393877 but this was meant to be fixed in 8.1.3 (we are running 8.1.3P1). Also mentioned on this VMware KB and thread:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd...
http://network-appliance-toasters.10978.n7.nabble.com/wafl-cp-slovol-warning...
These are the two bugs I was aware of:
393877 - inefficient pre-fetching of metadata blocks delays WAFL Consistency Point
599967 - Many concurrent deletions of large files can result in delays in CIFS or NFS operations (Not aware of any concurrent deletions)
I am thinking this is possibly another bug or the same one that isn't fixed in this situation as it doesn't look like a resource constraint.
I am going to try opening a case with Netapp but as I mentioned before don't hold out much hope with that.
Thanks Martin
-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/Netapp-FAS610-8-1-3P1-... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com. _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
I've opened a case with Netapp as an outage rather than performance issue as the Filer stopped serving data on all protocols, iSCSI, NFS, CIFs.
It looks to me like this is either another bug or the same one that isn't fixed in this particular situation.
Ill post an update when I have more info.
-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/Netapp-FAS610-8-1-3P1-... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.
You should be ready with a perfstat so if it happens again, you can hopefully capture one while it is happening. Netapp will probably ask fror that. Perhaps have it running constantly in a 15m loop or something.
--JMS
-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Martin Sent: Wednesday, October 23, 2013 9:03 AM To: toasters@teaparty.net Subject: RE: Netapp FAS610 8.1.3P1 stalling
I've opened a case with Netapp as an outage rather than performance issue as the Filer stopped serving data on all protocols, iSCSI, NFS, CIFs.
It looks to me like this is either another bug or the same one that isn't fixed in this particular situation.
Ill post an update when I have more info.
-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/Netapp-FAS610-8-1-3P1-... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com. _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Support highlighted we have:
wafl.cp.toolong.warning:warning
in the EMS log.
I've also just noticed a bug (682289) relating to slow CP points which isn't fixed until 8.2P3. I am not clear if this was introduced in 8.2 or if it is also in 8.1.3P1.
It does feel like this is a similar thread conflict bug though.
-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/Netapp-FAS610-8-1-3P1-... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.
Netapp have advised we have uneven distribution of load on the aggregate and I'v been running forced physical reallocates on each of the volumes on the aggregate in an attempt to eliminate this.
They also advised we have misaligned VMs which are more of a challenge to sort out.
It looks to me as though some thread on the Filer caused the whole Filer to stop responding but Netapp are unable to confirm this is a bug without a perstat. I recently read that Netapp advise running perfstats continually as it can have a detrimental performance impact. Bit of a catch 22.
I've added some more disks and spread the load across the two controllers in an attempt to reduce the load. I'm not convinced this is purely a load issue though as the Filer has been busier without any issues.
-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/Netapp-FAS610-8-1-3P1-... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.
We've reallocated the volumes on the aggregate and also moved misaligned VMs off to another controller.
Checking the EMS log I see the same CP too long message:
<LR d="xxNov2013 xx:27:54" n="Filer1" t="1384626474" id="1378829410/161861" p="4" s="Ok" o="wafl_CP_proc" vf="" type="0" seq="3573636" > <wafl_cp_toolong_warning_1 total_ms="143438" total_dbufs="143192" clean="506" v_ino="14" v_bm="27" a_ino="0" a_bm="15" flush="142664"/> </LR>
In this case it took 143 seconds to do a CP. This occurred when the load began to ramp up rather than at it's peak.
This looks to me like another NFS bug as the disks were 20-30% busy at the time and the CPU wasn't excessively high.
-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/Netapp-FAS610-8-1-3P1-... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.