Netapp FAS610 8.1.3P1 stalling

List overview All Threads
Download

newer

older

question about quotas for 7mode...

Cluster DATA vs 7-mode: admin...

Martin

22 Oct 2013 22 Oct '13

2:05 p.m.

One of our 6210 Filers running 8.1.3P1 stalled I/O for approx 1 minute (11:08:27 -11:09:24)

We saw this as latency of 5 seconds on the VMware hosts attached to the Filer via NFS and eventually "All Paths Down" messages on the ESX hosts.

I also saw warnings in the messages file:

Tue Oct 22 11:09:24 BST [TOASTER1: NwkThd_01:warning]: NFS response to client x.x.x.x for volume 0x5834a5c(vol004) was slow, op was v3 write, 69 > 60 (in seconds)

...

From looking in DFM Filer Summary view I see a "Z" shape in the graph for

most of the counters on the Filer e.g. CPU, Network Throughput, All Protocol Ops. The counters dip low then rapidly increase and tail off again. (see attached JPG) http://network-appliance-toasters.10978.n7.nabble.com/file/n25314/Filer-Z-shape.jpg

During this time it all of the ESX hosts saw timeouts to the NFS datastores.

I checked the disk_busy on the only aggregate (90 x 15K SAS 450GB) on the Filer and it only shows the disks as 30-40% busy and the disk_busy drops during the time the Filer stalled. It seems odd as the overall load on the Filer didn't increase to precipitate this.

http://network-appliance-toasters.10978.n7.nabble.com/file/n25314/Filer-Z-shape-disks.jpg

...

From previous experience of performance cases with Netapp, we're usually

asked to gather a perfstat next time it happens but this isn't possible as it's unpredictable and short lived.

The thing that concerns me is from previous experience this is usually a precursor to the Filer stalling for much longer periods causing much more impact. In the past we've found the only option is to upgrade the Filer head.

I would appreciate any pointers on how to identify the root cause of this.

-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/Netapp-FAS610-8-1-3P1-... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.

Show replies by date

Milazzo Giacomo

22 Oct 22 Oct

2:22 p.m.

New subject: R: Netapp FAS610 8.1.3P1 stalling

It remember me something happened few weeks ago to a customer of mine. You've got a bug but your version seems to be a fixing one

https://forums.netapp.com/thread/19352

-----Messaggio originale----- Da: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] Per conto di Martin Inviato: martedì 22 ottobre 2013 16.06 A: toasters@teaparty.net Oggetto: Netapp FAS610 8.1.3P1 stalling

One of our 6210 Filers running 8.1.3P1 stalled I/O for approx 1 minute (11:08:27 -11:09:24)

We saw this as latency of 5 seconds on the VMware hosts attached to the Filer via NFS and eventually "All Paths Down" messages on the ESX hosts.

I also saw warnings in the messages file:

Tue Oct 22 11:09:24 BST [TOASTER1: NwkThd_01:warning]: NFS response to client x.x.x.x for volume 0x5834a5c(vol004) was slow, op was v3 write, 69 > 60 (in seconds)

...

From looking in DFM Filer Summary view I see a "Z" shape in the graph for

During this time it all of the ESX hosts saw timeouts to the NFS datastores.

http://network-appliance-toasters.10978.n7.nabble.com/file/n25314/Filer-Z-shape-disks.jpg

...

From previous experience of performance cases with Netapp, we're usually

asked to gather a perfstat next time it happens but this isn't possible as it's unpredictable and short lived.

I would appreciate any pointers on how to identify the root cause of this.

-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/Netapp-FAS610-8-1-3P1-... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com. _______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Jordan Slingerland

2:29 p.m.

It sounds like you experienced "The Dead Cat bounce" I believe vmware suggests lowering the queue depth on your esxi host to 64.

-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Milazzo Giacomo Sent: Tuesday, October 22, 2013 10:22 AM To: Martin; toasters@teaparty.net Subject: R: Netapp FAS610 8.1.3P1 stalling

It remember me something happened few weeks ago to a customer of mine. You've got a bug but your version seems to be a fixing one

https://forums.netapp.com/thread/19352

One of our 6210 Filers running 8.1.3P1 stalled I/O for approx 1 minute (11:08:27 -11:09:24)

We saw this as latency of 5 seconds on the VMware hosts attached to the Filer via NFS and eventually "All Paths Down" messages on the ESX hosts.

I also saw warnings in the messages file:

Tue Oct 22 11:09:24 BST [TOASTER1: NwkThd_01:warning]: NFS response to client x.x.x.x for volume 0x5834a5c(vol004) was slow, op was v3 write, 69 > 60 (in seconds)

...

From looking in DFM Filer Summary view I see a "Z" shape in the graph for most of the counters on the Filer e.g. CPU, Network Throughput, All Protocol Ops. The counters dip low then rapidly increase and tail off again. (see attached JPG) http://network-appliance-toasters.10978.n7.nabble.com/file/n25314/Filer-Z-shape.jpg

During this time it all of the ESX hosts saw timeouts to the NFS datastores.

http://network-appliance-toasters.10978.n7.nabble.com/file/n25314/Filer-Z-shape-disks.jpg

...

From previous experience of performance cases with Netapp, we're usually asked to gather a perfstat next time it happens but this isn't possible as it's unpredictable and short lived.

I would appreciate any pointers on how to identify the root cause of this.

_______________________________________________ Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters

Martin

2:51 p.m.

This does look very similar to the bug 393877 but this was meant to be fixed in 8.1.3 (we are running 8.1.3P1). Also mentioned on this VMware KB and thread:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd...

http://network-appliance-toasters.10978.n7.nabble.com/wafl-cp-slovol-warning...

These are the two bugs I was aware of:

393877 - inefficient pre-fetching of metadata blocks delays WAFL Consistency Point

599967 - Many concurrent deletions of large files can result in delays in CIFS or NFS operations (Not aware of any concurrent deletions)

I am thinking this is possibly another bug or the same one that isn't fixed in this situation as it doesn't look like a resource constraint.

I am going to try opening a case with Netapp but as I mentioned before don't hold out much hope with that.

Thanks Martin

-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/Netapp-FAS610-8-1-3P1-... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.

Jordan Slingerland

2:57 p.m.

Yeah, I was thinking that after I replied, I was thinking you were on 8.1.2 P1

I had upgraded from 8.1.1 to 8.1.2P4 to resolve said issues.

I had to wait and press IBM to actually put out P4 as P3 was the latest available from IBM at the time. (N series)

I also enabled vmware SIOC limits which seemed to help a lot. Prior to the upgrade, nearly every vmotion not throttled with SIOC caused 393877 to be triggered.

-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Martin Sent: Tuesday, October 22, 2013 10:51 AM To: toasters@teaparty.net Subject: RE: Netapp FAS610 8.1.3P1 stalling

This does look very similar to the bug 393877 but this was meant to be fixed in 8.1.3 (we are running 8.1.3P1). Also mentioned on this VMware KB and thread:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd...

http://network-appliance-toasters.10978.n7.nabble.com/wafl-cp-slovol-warning...

These are the two bugs I was aware of:

393877 - inefficient pre-fetching of metadata blocks delays WAFL Consistency Point

599967 - Many concurrent deletions of large files can result in delays in CIFS or NFS operations (Not aware of any concurrent deletions)

I am thinking this is possibly another bug or the same one that isn't fixed in this situation as it doesn't look like a resource constraint.

I am going to try opening a case with Netapp but as I mentioned before don't hold out much hope with that.

Thanks Martin

Martin

23 Oct 23 Oct

1:03 p.m.

I've opened a case with Netapp as an outage rather than performance issue as the Filer stopped serving data on all protocols, iSCSI, NFS, CIFs.

It looks to me like this is either another bug or the same one that isn't fixed in this particular situation.

Ill post an update when I have more info.

-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/Netapp-FAS610-8-1-3P1-... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.

Jordan Slingerland

1:08 p.m.

You should be ready with a perfstat so if it happens again, you can hopefully capture one while it is happening. Netapp will probably ask fror that. Perhaps have it running constantly in a 15m loop or something.

--JMS

-----Original Message----- From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Martin Sent: Wednesday, October 23, 2013 9:03 AM To: toasters@teaparty.net Subject: RE: Netapp FAS610 8.1.3P1 stalling

I've opened a case with Netapp as an outage rather than performance issue as the Filer stopped serving data on all protocols, iSCSI, NFS, CIFs.

It looks to me like this is either another bug or the same one that isn't fixed in this particular situation.

Ill post an update when I have more info.

Martin

25 Oct 25 Oct

10:20 a.m.

Support highlighted we have:

wafl.cp.toolong.warning:warning

in the EMS log.

I've also just noticed a bug (682289) relating to slow CP points which isn't fixed until 8.2P3. I am not clear if this was introduced in 8.2 or if it is also in 8.1.3P1.

It does feel like this is a similar thread conflict bug though.

-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/Netapp-FAS610-8-1-3P1-... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.

Martin

5 Nov 5 Nov

10:51 a.m.

Netapp have advised we have uneven distribution of load on the aggregate and I'v been running forced physical reallocates on each of the volumes on the aggregate in an attempt to eliminate this.

They also advised we have misaligned VMs which are more of a challenge to sort out.

It looks to me as though some thread on the Filer caused the whole Filer to stop responding but Netapp are unable to confirm this is a bug without a perstat. I recently read that Netapp advise running perfstats continually as it can have a detrimental performance impact. Bit of a catch 22.

I've added some more disks and spread the load across the two controllers in an attempt to reduce the load. I'm not convinced this is purely a load issue though as the Filer has been busier without any issues.

-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/Netapp-FAS610-8-1-3P1-... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.

Martin

21 Nov 21 Nov

12:25 p.m.

We've reallocated the volumes on the aggregate and also moved misaligned VMs off to another controller.

Checking the EMS log I see the same CP too long message:

In this case it took 143 seconds to do a CP. This occurred when the load began to ramp up rather than at it's peak.

This looks to me like another NFS bug as the disks were 20-30% busy at the time and the CPU wasn't excessively high.

-- View this message in context: http://network-appliance-toasters.10978.n7.nabble.com/Netapp-FAS610-8-1-3P1-... Sent from the Network Appliance - Toasters mailing list archive at Nabble.com.

4361

Age (days ago)

4391

Last active (days ago)

toasters@lists.teaparty.net

9 comments

3 participants

tags (0)

participants (3)

Jordan Slingerland
Martin
Milazzo Giacomo