Hello. I am hoping you can guide me in the right direction
We have been experiencing very high CPU load on a 7-mode HA pair of 3270 controllers run 8.1.3P2
We have worked with netapp support on these issues and they note our workload is very high on one controller (where we run our VMware setup from)
We also have so called "bad practice" where we are running our exchange ISCSI LUNs on SATA with logs and dbs on the same aggregate (currently separating this out as I type)
I have been told by support we have VMDK misalignment, however I spent a long time a few months ago resolving this firstly by using the VSC tool to confirm the problem and then fixing it with a combination of MBRALIGN and VMware converter as a V2V process
The support guy tells me he seems misalignment when he runs nfsstat -d but MBRSCAN shows these are aligned. What is going on here?
Trying to reduce our CPU and IO burden but getting conflicting information.
Finally I think we should look to upgrade to 8.1.4P2 to remove some bugs? We would consider 8.2.x but I don't think we can as we run Exchange 2010 (using SME 6.x etc)
Thanks in advance
William
First off, make sure the values in nfsstat -d are actually incrementing significantly by running nfsstat -z to clear the counters and then wait a while and looking at nfsstat -d again.
You may find that you are only doing a handful of unaligned ops and not hundreds or thousands per second.
--Jordan
From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.uk Sent: Monday, June 16, 2014 2:50 PM To: Toasters@teaparty.net Subject: High CPU VM misalignment confusion
Hello. I am hoping you can guide me in the right direction
We have been experiencing very high CPU load on a 7-mode HA pair of 3270 controllers run 8.1.3P2
We have worked with netapp support on these issues and they note our workload is very high on one controller (where we run our VMware setup from)
We also have so called "bad practice" where we are running our exchange ISCSI LUNs on SATA with logs and dbs on the same aggregate (currently separating this out as I type)
I have been told by support we have VMDK misalignment, however I spent a long time a few months ago resolving this firstly by using the VSC tool to confirm the problem and then fixing it with a combination of MBRALIGN and VMware converter as a V2V process
The support guy tells me he seems misalignment when he runs nfsstat -d but MBRSCAN shows these are aligned. What is going on here?
Trying to reduce our CPU and IO burden but getting conflicting information.
Finally I think we should look to upgrade to 8.1.4P2 to remove some bugs? We would consider 8.2.x but I don't think we can as we run Exchange 2010 (using SME 6.x etc)
Thanks in advance
William
Thanks
I reset with the -z switch
I then run -d again a 5 minutes later. Many of the counters are in the 10's so I am happy with this. However 1 server is in the thousands already. This is a windows 2000 server (don't ask please!) which has a misaligned C drive but I have used the "functional aligned" datastore in VSC to get around this. I assume nfsstat -d won't understand that hence the counters in the thousands
William
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 19:57 To: Burchell, Will (ITSD); Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
First off, make sure the values in nfsstat -d are actually incrementing significantly by running nfsstat -z to clear the counters and then wait a while and looking at nfsstat -d again.
You may find that you are only doing a handful of unaligned ops and not hundreds or thousands per second.
--Jordan
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk Sent: Monday, June 16, 2014 2:50 PM To: Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: High CPU VM misalignment confusion
Hello. I am hoping you can guide me in the right direction
We have been experiencing very high CPU load on a 7-mode HA pair of 3270 controllers run 8.1.3P2
We have worked with netapp support on these issues and they note our workload is very high on one controller (where we run our VMware setup from)
We also have so called "bad practice" where we are running our exchange ISCSI LUNs on SATA with logs and dbs on the same aggregate (currently separating this out as I type)
I have been told by support we have VMDK misalignment, however I spent a long time a few months ago resolving this firstly by using the VSC tool to confirm the problem and then fixing it with a combination of MBRALIGN and VMware converter as a V2V process
The support guy tells me he seems misalignment when he runs nfsstat -d but MBRSCAN shows these are aligned. What is going on here?
Trying to reduce our CPU and IO burden but getting conflicting information.
Finally I think we should look to upgrade to 8.1.4P2 to remove some bugs? We would consider 8.2.x but I don't think we can as we run Exchange 2010 (using SME 6.x etc)
Thanks in advance
William
Yes, good obvservation.
nfsstat is just a dummy light.
Accept no action items purely on a view of a light on the dashboard.
On Mon, Jun 16, 2014 at 12:06 PM, Will.Burchell@skanska.co.uk wrote:
Thanks
I reset with the -z switch
I then run -d again a 5 minutes later. Many of the counters are in the 10's so I am happy with this. However 1 server is in the thousands already. This is a windows 2000 server (don't ask please!) which has a misaligned C drive but I have used the "functional aligned" datastore in VSC to get around this. I assume nfsstat -d won't understand that hence the counters in the thousands
William
*From:* Jordan Slingerland [mailto: Jordan.Slingerland@independenthealth.com] *Sent:* 16 June 2014 19:57 *To:* Burchell, Will (ITSD); Toasters@teaparty.net *Subject:* RE: High CPU VM misalignment confusion
First off, make sure the values in nfsstat -d are actually incrementing significantly by running nfsstat -z to clear the counters and then wait a while and looking at nfsstat -d again.
You may find that you are only doing a handful of unaligned ops and not hundreds or thousands per second.
--Jordan
*From:* toasters-bounces@teaparty.net [ mailto:toasters-bounces@teaparty.net toasters-bounces@teaparty.net] *On Behalf Of *Will.Burchell@skanska.co.uk *Sent:* Monday, June 16, 2014 2:50 PM *To:* Toasters@teaparty.net *Subject:* High CPU VM misalignment confusion
Hello. I am hoping you can guide me in the right direction
We have been experiencing very high CPU load on a 7-mode HA pair of 3270 controllers run 8.1.3P2
We have worked with netapp support on these issues and they note our workload is very high on one controller (where we run our VMware setup from)
We also have so called "bad practice" where we are running our exchange ISCSI LUNs on SATA with logs and dbs on the same aggregate (currently separating this out as I type)
I have been told by support we have VMDK misalignment, however I spent a long time a few months ago resolving this firstly by using the VSC tool to confirm the problem and then fixing it with a combination of MBRALIGN and VMware converter as a V2V process
The support guy tells me he seems misalignment when he runs nfsstat -d but MBRSCAN shows these are aligned. What is going on here?
Trying to reduce our CPU and IO burden but getting conflicting information.
Finally I think we should look to upgrade to 8.1.4P2 to remove some bugs? We would consider 8.2.x but I don't think we can as we run Exchange 2010 (using SME 6.x etc)
Thanks in advance
William
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
Even if it is 10k ops after 5 minutes...that is only 33 ops per second. I doubt 33 unaligned ops per second is your cpu issue.
Maybe you can fix that one top talker just to show support that is not the issue? ...depending how critical that 1 system is that may or may not be worth fighting over support with.
Now, on to the cpu issue. Are using "sysstat -m 1" to look at all cpus and not only the "ANY" cpu metric right?
If you do , for example, "sysstat -x 1" you are looking at the % of time that ANY of your cpus are busy. Seems to me this metric is nearly completely useless.
--Jordan
From: Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk] Sent: Monday, June 16, 2014 3:07 PM To: Jordan Slingerland; Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Thanks
I reset with the -z switch
I then run -d again a 5 minutes later. Many of the counters are in the 10's so I am happy with this. However 1 server is in the thousands already. This is a windows 2000 server (don't ask please!) which has a misaligned C drive but I have used the "functional aligned" datastore in VSC to get around this. I assume nfsstat -d won't understand that hence the counters in the thousands
William
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 19:57 To: Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
First off, make sure the values in nfsstat -d are actually incrementing significantly by running nfsstat -z to clear the counters and then wait a while and looking at nfsstat -d again.
You may find that you are only doing a handful of unaligned ops and not hundreds or thousands per second.
--Jordan
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk Sent: Monday, June 16, 2014 2:50 PM To: Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: High CPU VM misalignment confusion
Hello. I am hoping you can guide me in the right direction
We have been experiencing very high CPU load on a 7-mode HA pair of 3270 controllers run 8.1.3P2
We have worked with netapp support on these issues and they note our workload is very high on one controller (where we run our VMware setup from)
We also have so called "bad practice" where we are running our exchange ISCSI LUNs on SATA with logs and dbs on the same aggregate (currently separating this out as I type)
I have been told by support we have VMDK misalignment, however I spent a long time a few months ago resolving this firstly by using the VSC tool to confirm the problem and then fixing it with a combination of MBRALIGN and VMware converter as a V2V process
The support guy tells me he seems misalignment when he runs nfsstat -d but MBRSCAN shows these are aligned. What is going on here?
Trying to reduce our CPU and IO burden but getting conflicting information.
Finally I think we should look to upgrade to 8.1.4P2 to remove some bugs? We would consider 8.2.x but I don't think we can as we run Exchange 2010 (using SME 6.x etc)
Thanks in advance
William
I am checking all cpus and they are pretty busy
We are in the UK so it's out of hours (and our nightly process are mostly stopped right now)
We have the issue I mentioned where our exchange LUNs are on the same aggregate together and we have a high IO workload with 6000 mailboxes.
This is the sysstat -M 1 right now as an example. It seems high considering there is no de-dupe and only a single snapmirror running (to do a vol move for our exchange separation problem)
Any other thoughts and I am most interested
William
ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 92% 76% 56% 32% 68% 67% 73% 75% 59% 47% 0% 0% 22% 26% 0% 12% 105%( 65%) 0% 5% 0% 34% 9% 13% 9152 0% 98% 90% 80% 60% 85% 87% 89% 88% 75% 31% 0% 0% 26% 42% 0% 15% 122%( 72%) 21% 11% 0% 52% 9% 9% 5637 50% 99% 92% 80% 55% 84% 86% 90% 91% 69% 31% 0% 0% 29% 58% 0% 16% 103%( 66%) 3% 11% 0% 67% 9% 8% 5962 100% 98% 91% 79% 54% 84% 84% 88% 90% 72% 45% 0% 0% 27% 45% 0% 9% 122%( 78%) 0% 12% 0% 54% 8% 10% 7222 100% 99% 94% 88% 67% 89% 91% 94% 94% 77% 25% 0% 0% 28% 63% 0% 21% 100%( 65%) 25% 9% 0% 71% 8% 7% 4452 100% 97% 91% 79% 52% 83% 84% 89% 91% 67% 39% 0% 0% 30% 51% 0% 8% 113%( 73%) 0% 10% 0% 62% 8% 9% 8253 100% 98% 87% 71% 44% 78% 79% 84% 83% 67% 46% 0% 0% 25% 33% 0% 14% 121%( 74%) 0% 9% 0% 42% 12% 11% 9237 66% 97% 93% 86% 65% 88% 87% 92% 93% 80% 29% 0% 0% 27% 50% 0% 22% 116%( 69%) 24% 9% 0% 59% 9% 8% 5213 63% 97% 85% 69% 42% 76% 77% 83% 85% 60% 37% 0% 0% 28% 42% 0% 9% 109%( 69%) 1% 10% 0% 48% 11% 10% 6795 100% 98% 91% 77% 50% 82% 83% 88% 91% 66% 39% 0% 0% 30% 50% 0% 8% 116%( 73%) 0% 10% 0% 58% 7% 10% 6993 100% 98% 92% 82% 62% 86% 85% 90% 91% 78% 29% 0% 0% 28% 51% 0% 20% 108%( 69%) 21% 14% 0% 59% 6% 8% 5308 90% 100% 97% 91% 65% 90% 92% 94% 96% 80% 30% 0% 0% 30% 59% 0% 16% 120%( 76%) 3% 20% 0% 68% 9% 7% 5593 100% 98% 85% 70% 47% 78% 76% 82% 81% 71% 33% 0% 0% 26% 41% 0% 16% 110%( 70%) 4% 12% 0% 48% 10% 10% 5907 79% 100% 98% 89% 61% 89% 91% 94% 96% 75% 28% 0% 0% 32% 62% 0% 20% 98%( 65%) 17% 10% 0% 73% 8% 7% 5290 100% 98% 91% 77% 50% 82% 80% 85% 89% 72% 33% 0% 0% 30% 48% 0% 21% 108%( 64%) 0% 12% 0% 59% 6% 9% 6047 100% 99% 91% 75% 49% 82% 80% 84% 85% 77% 36% 0% 0% 26% 29% 0% 15% 144%( 80%) 1% 12% 0% 44% 10% 10% 6412 67% 100% 95% 88% 68% 90% 88% 94% 97% 80% 26% 0% 0% 29% 59% 0% 26% 100%( 66%) 23% 16% 0% 65% 8% 7% 4602 100% 98% 87% 74% 48% 79% 78% 86% 90% 63% 30% 0% 0% 29% 52% 0% 9% 105%( 68%) 0% 14% 0% 60% 9% 8% 5533 100% 98% 88% 77% 58% 83% 81% 87% 90% 73% 30% 0% 0% 27% 47% 0% 19% 106%( 66%) 21% 10% 0% 54% 9% 8% 5691 98% ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 97% 86% 70% 43% 77% 77% 84% 87% 61% 39% 0% 0% 27% 39% 0% 7% 116%( 73%) 0% 11% 0% 49% 11% 10% 7526 100% 97% 86% 70% 44% 78% 80% 85% 87% 61% 34% 0% 0% 28% 44% 0% 9% 108%( 68%) 0% 14% 0% 53% 11% 13% 6308 100% 98% 87% 77% 59% 82% 80% 86% 88% 76% 28% 0% 0% 24% 44% 0% 23% 106%( 66%) 21% 14% 0% 53% 9% 8% 5200 82% 100% 96% 86% 57% 87% 88% 92% 95% 73% 30% 0% 0% 30% 56% 0% 18% 111%( 69%) 3% 18% 0% 68% 6% 8% 5163 100% 98% 90% 78% 55% 83% 82% 88% 91% 69% 32% 0% 0% 28% 44% 0% 11% 119%( 74%) 6% 19% 0% 54% 9% 9% 6148 99% 100% 97% 89% 64% 89% 91% 93% 96% 75% 34% 0% 0% 30% 62% 0% 22% 99%( 65%) 17% 9% 0% 70% 6% 8% 6496 100%
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 20:14 To: Burchell, Will (ITSD); Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Even if it is 10k ops after 5 minutes...that is only 33 ops per second. I doubt 33 unaligned ops per second is your cpu issue.
Maybe you can fix that one top talker just to show support that is not the issue? ...depending how critical that 1 system is that may or may not be worth fighting over support with.
Now, on to the cpu issue. Are using "sysstat -m 1" to look at all cpus and not only the "ANY" cpu metric right?
If you do , for example, "sysstat -x 1" you are looking at the % of time that ANY of your cpus are busy. Seems to me this metric is nearly completely useless.
--Jordan
From: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk] Sent: Monday, June 16, 2014 3:07 PM To: Jordan Slingerland; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Thanks
I reset with the -z switch
I then run -d again a 5 minutes later. Many of the counters are in the 10's so I am happy with this. However 1 server is in the thousands already. This is a windows 2000 server (don't ask please!) which has a misaligned C drive but I have used the "functional aligned" datastore in VSC to get around this. I assume nfsstat -d won't understand that hence the counters in the thousands
William
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 19:57 To: Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
First off, make sure the values in nfsstat -d are actually incrementing significantly by running nfsstat -z to clear the counters and then wait a while and looking at nfsstat -d again.
You may find that you are only doing a handful of unaligned ops and not hundreds or thousands per second.
--Jordan
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk Sent: Monday, June 16, 2014 2:50 PM To: Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: High CPU VM misalignment confusion
Hello. I am hoping you can guide me in the right direction
We have been experiencing very high CPU load on a 7-mode HA pair of 3270 controllers run 8.1.3P2
We have worked with netapp support on these issues and they note our workload is very high on one controller (where we run our VMware setup from)
We also have so called "bad practice" where we are running our exchange ISCSI LUNs on SATA with logs and dbs on the same aggregate (currently separating this out as I type)
I have been told by support we have VMDK misalignment, however I spent a long time a few months ago resolving this firstly by using the VSC tool to confirm the problem and then fixing it with a combination of MBRALIGN and VMware converter as a V2V process
The support guy tells me he seems misalignment when he runs nfsstat -d but MBRSCAN shows these are aligned. What is going on here?
Trying to reduce our CPU and IO burden but getting conflicting information.
Finally I think we should look to upgrade to 8.1.4P2 to remove some bugs? We would consider 8.2.x but I don't think we can as we run Exchange 2010 (using SME 6.x etc)
Thanks in advance
William
Hi,
Might be worth checking out the article at this link http://www.jk-47.com/2014/02/attack-of-old-bugs-netapp-high-cpu/
We ran into this exact issue after upgrading a system from 8.0.3P2 to 8.1.4P1. A process looping in wafl scan blk_reclaim.
BR Marcus
From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.uk Sent: den 16 juni 2014 21:19 To: Jordan.Slingerland@independenthealth.com; Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
I am checking all cpus and they are pretty busy
We are in the UK so it's out of hours (and our nightly process are mostly stopped right now)
We have the issue I mentioned where our exchange LUNs are on the same aggregate together and we have a high IO workload with 6000 mailboxes.
This is the sysstat -M 1 right now as an example. It seems high considering there is no de-dupe and only a single snapmirror running (to do a vol move for our exchange separation problem)
Any other thoughts and I am most interested
William
ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP
92% 76% 56% 32% 68% 67% 73% 75% 59% 47% 0% 0% 22% 26% 0% 12% 105%( 65%) 0% 5% 0% 34% 9% 13% 9152 0%
98% 90% 80% 60% 85% 87% 89% 88% 75% 31% 0% 0% 26% 42% 0% 15% 122%( 72%) 21% 11% 0% 52% 9% 9% 5637 50%
99% 92% 80% 55% 84% 86% 90% 91% 69% 31% 0% 0% 29% 58% 0% 16% 103%( 66%) 3% 11% 0% 67% 9% 8% 5962 100%
98% 91% 79% 54% 84% 84% 88% 90% 72% 45% 0% 0% 27% 45% 0% 9% 122%( 78%) 0% 12% 0% 54% 8% 10% 7222 100%
99% 94% 88% 67% 89% 91% 94% 94% 77% 25% 0% 0% 28% 63% 0% 21% 100%( 65%) 25% 9% 0% 71% 8% 7% 4452 100%
97% 91% 79% 52% 83% 84% 89% 91% 67% 39% 0% 0% 30% 51% 0% 8% 113%( 73%) 0% 10% 0% 62% 8% 9% 8253 100%
98% 87% 71% 44% 78% 79% 84% 83% 67% 46% 0% 0% 25% 33% 0% 14% 121%( 74%) 0% 9% 0% 42% 12% 11% 9237 66%
97% 93% 86% 65% 88% 87% 92% 93% 80% 29% 0% 0% 27% 50% 0% 22% 116%( 69%) 24% 9% 0% 59% 9% 8% 5213 63%
97% 85% 69% 42% 76% 77% 83% 85% 60% 37% 0% 0% 28% 42% 0% 9% 109%( 69%) 1% 10% 0% 48% 11% 10% 6795 100%
98% 91% 77% 50% 82% 83% 88% 91% 66% 39% 0% 0% 30% 50% 0% 8% 116%( 73%) 0% 10% 0% 58% 7% 10% 6993 100%
98% 92% 82% 62% 86% 85% 90% 91% 78% 29% 0% 0% 28% 51% 0% 20% 108%( 69%) 21% 14% 0% 59% 6% 8% 5308 90%
100% 97% 91% 65% 90% 92% 94% 96% 80% 30% 0% 0% 30% 59% 0% 16% 120%( 76%) 3% 20% 0% 68% 9% 7% 5593 100%
98% 85% 70% 47% 78% 76% 82% 81% 71% 33% 0% 0% 26% 41% 0% 16% 110%( 70%) 4% 12% 0% 48% 10% 10% 5907 79%
100% 98% 89% 61% 89% 91% 94% 96% 75% 28% 0% 0% 32% 62% 0% 20% 98%( 65%) 17% 10% 0% 73% 8% 7% 5290 100%
98% 91% 77% 50% 82% 80% 85% 89% 72% 33% 0% 0% 30% 48% 0% 21% 108%( 64%) 0% 12% 0% 59% 6% 9% 6047 100%
99% 91% 75% 49% 82% 80% 84% 85% 77% 36% 0% 0% 26% 29% 0% 15% 144%( 80%) 1% 12% 0% 44% 10% 10% 6412 67%
100% 95% 88% 68% 90% 88% 94% 97% 80% 26% 0% 0% 29% 59% 0% 26% 100%( 66%) 23% 16% 0% 65% 8% 7% 4602 100%
98% 87% 74% 48% 79% 78% 86% 90% 63% 30% 0% 0% 29% 52% 0% 9% 105%( 68%) 0% 14% 0% 60% 9% 8% 5533 100%
98% 88% 77% 58% 83% 81% 87% 90% 73% 30% 0% 0% 27% 47% 0% 19% 106%( 66%) 21% 10% 0% 54% 9% 8% 5691 98%
ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP
97% 86% 70% 43% 77% 77% 84% 87% 61% 39% 0% 0% 27% 39% 0% 7% 116%( 73%) 0% 11% 0% 49% 11% 10% 7526 100%
97% 86% 70% 44% 78% 80% 85% 87% 61% 34% 0% 0% 28% 44% 0% 9% 108%( 68%) 0% 14% 0% 53% 11% 13% 6308 100%
98% 87% 77% 59% 82% 80% 86% 88% 76% 28% 0% 0% 24% 44% 0% 23% 106%( 66%) 21% 14% 0% 53% 9% 8% 5200 82%
100% 96% 86% 57% 87% 88% 92% 95% 73% 30% 0% 0% 30% 56% 0% 18% 111%( 69%) 3% 18% 0% 68% 6% 8% 5163 100%
98% 90% 78% 55% 83% 82% 88% 91% 69% 32% 0% 0% 28% 44% 0% 11% 119%( 74%) 6% 19% 0% 54% 9% 9% 6148 99%
100% 97% 89% 64% 89% 91% 93% 96% 75% 34% 0% 0% 30% 62% 0% 22% 99%( 65%) 17% 9% 0% 70% 6% 8% 6496 100%
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 20:14 To: Burchell, Will (ITSD); Toasters@teaparty.net mailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Even if it is 10k ops after 5 minutes...that is only 33 ops per second. I doubt 33 unaligned ops per second is your cpu issue.
Maybe you can fix that one top talker just to show support that is not the issue? .depending how critical that 1 system is that may or may not be worth fighting over support with.
Now, on to the cpu issue. Are using "sysstat -m 1" to look at all cpus and not only the "ANY" cpu metric right?
If you do , for example, "sysstat -x 1" you are looking at the % of time that ANY of your cpus are busy. Seems to me this metric is nearly completely useless.
--Jordan
From: Will.Burchell@skanska.co.uk mailto:Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk] Sent: Monday, June 16, 2014 3:07 PM To: Jordan Slingerland; Toasters@teaparty.net mailto:Toasters@teaparty.net
Subject: RE: High CPU VM misalignment confusion
Thanks
I reset with the -z switch
I then run -d again a 5 minutes later. Many of the counters are in the 10's so I am happy with this. However 1 server is in the thousands already. This is a windows 2000 server (don't ask please!) which has a misaligned C drive but I have used the "functional aligned" datastore in VSC to get around this. I assume nfsstat -d won't understand that hence the counters in the thousands
William
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 19:57 To: Burchell, Will (ITSD); Toasters@teaparty.net mailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
First off, make sure the values in nfsstat -d are actually incrementing significantly by running nfsstat -z to clear the counters and then wait a while and looking at nfsstat -d again.
You may find that you are only doing a handful of unaligned ops and not hundreds or thousands per second.
--Jordan
From: toasters-bounces@teaparty.net mailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.uk mailto:Will.Burchell@skanska.co.uk Sent: Monday, June 16, 2014 2:50 PM To: Toasters@teaparty.net mailto:Toasters@teaparty.net Subject: High CPU VM misalignment confusion
Hello. I am hoping you can guide me in the right direction
We have been experiencing very high CPU load on a 7-mode HA pair of 3270 controllers run 8.1.3P2
We have worked with netapp support on these issues and they note our workload is very high on one controller (where we run our VMware setup from)
We also have so called "bad practice" where we are running our exchange ISCSI LUNs on SATA with logs and dbs on the same aggregate (currently separating this out as I type)
I have been told by support we have VMDK misalignment, however I spent a long time a few months ago resolving this firstly by using the VSC tool to confirm the problem and then fixing it with a combination of MBRALIGN and VMware converter as a V2V process
The support guy tells me he seems misalignment when he runs nfsstat -d but MBRSCAN shows these are aligned. What is going on here?
Trying to reduce our CPU and IO burden but getting conflicting information.
Finally I think we should look to upgrade to 8.1.4P2 to remove some bugs? We would consider 8.2.x but I don't think we can as we run Exchange 2010 (using SME 6.x etc)
Thanks in advance
William
It sounds like you ruled out the obvious, but I will say it anyway. but no deduplciations running, right?
And not so obvious, If none running, look at sis status -l and check if any of the volumes are over 20% in the Stale Fingerprints: column.
--Jordan
From: Marcus Nilsson [mailto:marcus.nilsson@atea.se] Sent: Monday, June 16, 2014 3:44 PM To: Will.Burchell@skanska.co.uk; Jordan Slingerland; Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Hi, Might be worth checking out the article at this link http://www.jk-47.com/2014/02/attack-of-old-bugs-netapp-high-cpu/
We ran into this exact issue after upgrading a system from 8.0.3P2 to 8.1.4P1. A process looping in wafl scan blk_reclaim.
BR Marcus
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk Sent: den 16 juni 2014 21:19 To: Jordan.Slingerland@independenthealth.commailto:Jordan.Slingerland@independenthealth.com; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
I am checking all cpus and they are pretty busy
We are in the UK so it's out of hours (and our nightly process are mostly stopped right now)
We have the issue I mentioned where our exchange LUNs are on the same aggregate together and we have a high IO workload with 6000 mailboxes.
This is the sysstat -M 1 right now as an example. It seems high considering there is no de-dupe and only a single snapmirror running (to do a vol move for our exchange separation problem)
Any other thoughts and I am most interested
William
ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 92% 76% 56% 32% 68% 67% 73% 75% 59% 47% 0% 0% 22% 26% 0% 12% 105%( 65%) 0% 5% 0% 34% 9% 13% 9152 0% 98% 90% 80% 60% 85% 87% 89% 88% 75% 31% 0% 0% 26% 42% 0% 15% 122%( 72%) 21% 11% 0% 52% 9% 9% 5637 50% 99% 92% 80% 55% 84% 86% 90% 91% 69% 31% 0% 0% 29% 58% 0% 16% 103%( 66%) 3% 11% 0% 67% 9% 8% 5962 100% 98% 91% 79% 54% 84% 84% 88% 90% 72% 45% 0% 0% 27% 45% 0% 9% 122%( 78%) 0% 12% 0% 54% 8% 10% 7222 100% 99% 94% 88% 67% 89% 91% 94% 94% 77% 25% 0% 0% 28% 63% 0% 21% 100%( 65%) 25% 9% 0% 71% 8% 7% 4452 100% 97% 91% 79% 52% 83% 84% 89% 91% 67% 39% 0% 0% 30% 51% 0% 8% 113%( 73%) 0% 10% 0% 62% 8% 9% 8253 100% 98% 87% 71% 44% 78% 79% 84% 83% 67% 46% 0% 0% 25% 33% 0% 14% 121%( 74%) 0% 9% 0% 42% 12% 11% 9237 66% 97% 93% 86% 65% 88% 87% 92% 93% 80% 29% 0% 0% 27% 50% 0% 22% 116%( 69%) 24% 9% 0% 59% 9% 8% 5213 63% 97% 85% 69% 42% 76% 77% 83% 85% 60% 37% 0% 0% 28% 42% 0% 9% 109%( 69%) 1% 10% 0% 48% 11% 10% 6795 100% 98% 91% 77% 50% 82% 83% 88% 91% 66% 39% 0% 0% 30% 50% 0% 8% 116%( 73%) 0% 10% 0% 58% 7% 10% 6993 100% 98% 92% 82% 62% 86% 85% 90% 91% 78% 29% 0% 0% 28% 51% 0% 20% 108%( 69%) 21% 14% 0% 59% 6% 8% 5308 90% 100% 97% 91% 65% 90% 92% 94% 96% 80% 30% 0% 0% 30% 59% 0% 16% 120%( 76%) 3% 20% 0% 68% 9% 7% 5593 100% 98% 85% 70% 47% 78% 76% 82% 81% 71% 33% 0% 0% 26% 41% 0% 16% 110%( 70%) 4% 12% 0% 48% 10% 10% 5907 79% 100% 98% 89% 61% 89% 91% 94% 96% 75% 28% 0% 0% 32% 62% 0% 20% 98%( 65%) 17% 10% 0% 73% 8% 7% 5290 100% 98% 91% 77% 50% 82% 80% 85% 89% 72% 33% 0% 0% 30% 48% 0% 21% 108%( 64%) 0% 12% 0% 59% 6% 9% 6047 100% 99% 91% 75% 49% 82% 80% 84% 85% 77% 36% 0% 0% 26% 29% 0% 15% 144%( 80%) 1% 12% 0% 44% 10% 10% 6412 67% 100% 95% 88% 68% 90% 88% 94% 97% 80% 26% 0% 0% 29% 59% 0% 26% 100%( 66%) 23% 16% 0% 65% 8% 7% 4602 100% 98% 87% 74% 48% 79% 78% 86% 90% 63% 30% 0% 0% 29% 52% 0% 9% 105%( 68%) 0% 14% 0% 60% 9% 8% 5533 100% 98% 88% 77% 58% 83% 81% 87% 90% 73% 30% 0% 0% 27% 47% 0% 19% 106%( 66%) 21% 10% 0% 54% 9% 8% 5691 98% ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 97% 86% 70% 43% 77% 77% 84% 87% 61% 39% 0% 0% 27% 39% 0% 7% 116%( 73%) 0% 11% 0% 49% 11% 10% 7526 100% 97% 86% 70% 44% 78% 80% 85% 87% 61% 34% 0% 0% 28% 44% 0% 9% 108%( 68%) 0% 14% 0% 53% 11% 13% 6308 100% 98% 87% 77% 59% 82% 80% 86% 88% 76% 28% 0% 0% 24% 44% 0% 23% 106%( 66%) 21% 14% 0% 53% 9% 8% 5200 82% 100% 96% 86% 57% 87% 88% 92% 95% 73% 30% 0% 0% 30% 56% 0% 18% 111%( 69%) 3% 18% 0% 68% 6% 8% 5163 100% 98% 90% 78% 55% 83% 82% 88% 91% 69% 32% 0% 0% 28% 44% 0% 11% 119%( 74%) 6% 19% 0% 54% 9% 9% 6148 99% 100% 97% 89% 64% 89% 91% 93% 96% 75% 34% 0% 0% 30% 62% 0% 22% 99%( 65%) 17% 9% 0% 70% 6% 8% 6496 100%
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 20:14 To: Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Even if it is 10k ops after 5 minutes...that is only 33 ops per second. I doubt 33 unaligned ops per second is your cpu issue.
Maybe you can fix that one top talker just to show support that is not the issue? ...depending how critical that 1 system is that may or may not be worth fighting over support with.
Now, on to the cpu issue. Are using "sysstat -m 1" to look at all cpus and not only the "ANY" cpu metric right?
If you do , for example, "sysstat -x 1" you are looking at the % of time that ANY of your cpus are busy. Seems to me this metric is nearly completely useless.
--Jordan
From: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk] Sent: Monday, June 16, 2014 3:07 PM To: Jordan Slingerland; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Thanks
I reset with the -z switch
I then run -d again a 5 minutes later. Many of the counters are in the 10's so I am happy with this. However 1 server is in the thousands already. This is a windows 2000 server (don't ask please!) which has a misaligned C drive but I have used the "functional aligned" datastore in VSC to get around this. I assume nfsstat -d won't understand that hence the counters in the thousands
William
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 19:57 To: Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
First off, make sure the values in nfsstat -d are actually incrementing significantly by running nfsstat -z to clear the counters and then wait a while and looking at nfsstat -d again.
You may find that you are only doing a handful of unaligned ops and not hundreds or thousands per second.
--Jordan
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk Sent: Monday, June 16, 2014 2:50 PM To: Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: High CPU VM misalignment confusion
Hello. I am hoping you can guide me in the right direction
We have been experiencing very high CPU load on a 7-mode HA pair of 3270 controllers run 8.1.3P2
We have worked with netapp support on these issues and they note our workload is very high on one controller (where we run our VMware setup from)
We also have so called "bad practice" where we are running our exchange ISCSI LUNs on SATA with logs and dbs on the same aggregate (currently separating this out as I type)
I have been told by support we have VMDK misalignment, however I spent a long time a few months ago resolving this firstly by using the VSC tool to confirm the problem and then fixing it with a combination of MBRALIGN and VMware converter as a V2V process
The support guy tells me he seems misalignment when he runs nfsstat -d but MBRSCAN shows these are aligned. What is going on here?
Trying to reduce our CPU and IO burden but getting conflicting information.
Finally I think we should look to upgrade to 8.1.4P2 to remove some bugs? We would consider 8.2.x but I don't think we can as we run Exchange 2010 (using SME 6.x etc)
Thanks in advance
William
No deduplications are running
So I have run sis status -l and can confirm the following SIS jobs and their stale fingerprints. This looks pretty bad. What are we doing wrong here?
We have upgraded from various versions of ONTAP last year and believe we ran into SIS issue but thought they had been cleared by running the sis start -S command to clean them out
We are on 8.1.3P2 and came from 8.0.x into many versions of 8.1.x over the last 18 months
Will
8% 1% 79% 14% 0% 52% 130% 108% 126% 117% 181% 112% 121% 81% 7% 0% 0% 0% 0% 25% 0% 26%
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 20:47 To: Marcus Nilsson; Burchell, Will (ITSD); Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
It sounds like you ruled out the obvious, but I will say it anyway. but no deduplciations running, right?
And not so obvious, If none running, look at sis status -l and check if any of the volumes are over 20% in the Stale Fingerprints: column.
--Jordan
From: Marcus Nilsson [mailto:marcus.nilsson@atea.se] Sent: Monday, June 16, 2014 3:44 PM To: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk; Jordan Slingerland; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Hi, Might be worth checking out the article at this link http://www.jk-47.com/2014/02/attack-of-old-bugs-netapp-high-cpu/
We ran into this exact issue after upgrading a system from 8.0.3P2 to 8.1.4P1. A process looping in wafl scan blk_reclaim.
BR Marcus
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk Sent: den 16 juni 2014 21:19 To: Jordan.Slingerland@independenthealth.commailto:Jordan.Slingerland@independenthealth.com; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
I am checking all cpus and they are pretty busy
We are in the UK so it's out of hours (and our nightly process are mostly stopped right now)
We have the issue I mentioned where our exchange LUNs are on the same aggregate together and we have a high IO workload with 6000 mailboxes.
This is the sysstat -M 1 right now as an example. It seems high considering there is no de-dupe and only a single snapmirror running (to do a vol move for our exchange separation problem)
Any other thoughts and I am most interested
William
ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 92% 76% 56% 32% 68% 67% 73% 75% 59% 47% 0% 0% 22% 26% 0% 12% 105%( 65%) 0% 5% 0% 34% 9% 13% 9152 0% 98% 90% 80% 60% 85% 87% 89% 88% 75% 31% 0% 0% 26% 42% 0% 15% 122%( 72%) 21% 11% 0% 52% 9% 9% 5637 50% 99% 92% 80% 55% 84% 86% 90% 91% 69% 31% 0% 0% 29% 58% 0% 16% 103%( 66%) 3% 11% 0% 67% 9% 8% 5962 100% 98% 91% 79% 54% 84% 84% 88% 90% 72% 45% 0% 0% 27% 45% 0% 9% 122%( 78%) 0% 12% 0% 54% 8% 10% 7222 100% 99% 94% 88% 67% 89% 91% 94% 94% 77% 25% 0% 0% 28% 63% 0% 21% 100%( 65%) 25% 9% 0% 71% 8% 7% 4452 100% 97% 91% 79% 52% 83% 84% 89% 91% 67% 39% 0% 0% 30% 51% 0% 8% 113%( 73%) 0% 10% 0% 62% 8% 9% 8253 100% 98% 87% 71% 44% 78% 79% 84% 83% 67% 46% 0% 0% 25% 33% 0% 14% 121%( 74%) 0% 9% 0% 42% 12% 11% 9237 66% 97% 93% 86% 65% 88% 87% 92% 93% 80% 29% 0% 0% 27% 50% 0% 22% 116%( 69%) 24% 9% 0% 59% 9% 8% 5213 63% 97% 85% 69% 42% 76% 77% 83% 85% 60% 37% 0% 0% 28% 42% 0% 9% 109%( 69%) 1% 10% 0% 48% 11% 10% 6795 100% 98% 91% 77% 50% 82% 83% 88% 91% 66% 39% 0% 0% 30% 50% 0% 8% 116%( 73%) 0% 10% 0% 58% 7% 10% 6993 100% 98% 92% 82% 62% 86% 85% 90% 91% 78% 29% 0% 0% 28% 51% 0% 20% 108%( 69%) 21% 14% 0% 59% 6% 8% 5308 90% 100% 97% 91% 65% 90% 92% 94% 96% 80% 30% 0% 0% 30% 59% 0% 16% 120%( 76%) 3% 20% 0% 68% 9% 7% 5593 100% 98% 85% 70% 47% 78% 76% 82% 81% 71% 33% 0% 0% 26% 41% 0% 16% 110%( 70%) 4% 12% 0% 48% 10% 10% 5907 79% 100% 98% 89% 61% 89% 91% 94% 96% 75% 28% 0% 0% 32% 62% 0% 20% 98%( 65%) 17% 10% 0% 73% 8% 7% 5290 100% 98% 91% 77% 50% 82% 80% 85% 89% 72% 33% 0% 0% 30% 48% 0% 21% 108%( 64%) 0% 12% 0% 59% 6% 9% 6047 100% 99% 91% 75% 49% 82% 80% 84% 85% 77% 36% 0% 0% 26% 29% 0% 15% 144%( 80%) 1% 12% 0% 44% 10% 10% 6412 67% 100% 95% 88% 68% 90% 88% 94% 97% 80% 26% 0% 0% 29% 59% 0% 26% 100%( 66%) 23% 16% 0% 65% 8% 7% 4602 100% 98% 87% 74% 48% 79% 78% 86% 90% 63% 30% 0% 0% 29% 52% 0% 9% 105%( 68%) 0% 14% 0% 60% 9% 8% 5533 100% 98% 88% 77% 58% 83% 81% 87% 90% 73% 30% 0% 0% 27% 47% 0% 19% 106%( 66%) 21% 10% 0% 54% 9% 8% 5691 98% ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 97% 86% 70% 43% 77% 77% 84% 87% 61% 39% 0% 0% 27% 39% 0% 7% 116%( 73%) 0% 11% 0% 49% 11% 10% 7526 100% 97% 86% 70% 44% 78% 80% 85% 87% 61% 34% 0% 0% 28% 44% 0% 9% 108%( 68%) 0% 14% 0% 53% 11% 13% 6308 100% 98% 87% 77% 59% 82% 80% 86% 88% 76% 28% 0% 0% 24% 44% 0% 23% 106%( 66%) 21% 14% 0% 53% 9% 8% 5200 82% 100% 96% 86% 57% 87% 88% 92% 95% 73% 30% 0% 0% 30% 56% 0% 18% 111%( 69%) 3% 18% 0% 68% 6% 8% 5163 100% 98% 90% 78% 55% 83% 82% 88% 91% 69% 32% 0% 0% 28% 44% 0% 11% 119%( 74%) 6% 19% 0% 54% 9% 9% 6148 99% 100% 97% 89% 64% 89% 91% 93% 96% 75% 34% 0% 0% 30% 62% 0% 22% 99%( 65%) 17% 9% 0% 70% 6% 8% 6496 100%
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 20:14 To: Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Even if it is 10k ops after 5 minutes...that is only 33 ops per second. I doubt 33 unaligned ops per second is your cpu issue.
Maybe you can fix that one top talker just to show support that is not the issue? ...depending how critical that 1 system is that may or may not be worth fighting over support with.
Now, on to the cpu issue. Are using "sysstat -m 1" to look at all cpus and not only the "ANY" cpu metric right?
If you do , for example, "sysstat -x 1" you are looking at the % of time that ANY of your cpus are busy. Seems to me this metric is nearly completely useless.
--Jordan
From: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk] Sent: Monday, June 16, 2014 3:07 PM To: Jordan Slingerland; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Thanks
I reset with the -z switch
I then run -d again a 5 minutes later. Many of the counters are in the 10's so I am happy with this. However 1 server is in the thousands already. This is a windows 2000 server (don't ask please!) which has a misaligned C drive but I have used the "functional aligned" datastore in VSC to get around this. I assume nfsstat -d won't understand that hence the counters in the thousands
William
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 19:57 To: Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
First off, make sure the values in nfsstat -d are actually incrementing significantly by running nfsstat -z to clear the counters and then wait a while and looking at nfsstat -d again.
You may find that you are only doing a handful of unaligned ops and not hundreds or thousands per second.
--Jordan
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk Sent: Monday, June 16, 2014 2:50 PM To: Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: High CPU VM misalignment confusion
Hello. I am hoping you can guide me in the right direction
We have been experiencing very high CPU load on a 7-mode HA pair of 3270 controllers run 8.1.3P2
We have worked with netapp support on these issues and they note our workload is very high on one controller (where we run our VMware setup from)
We also have so called "bad practice" where we are running our exchange ISCSI LUNs on SATA with logs and dbs on the same aggregate (currently separating this out as I type)
I have been told by support we have VMDK misalignment, however I spent a long time a few months ago resolving this firstly by using the VSC tool to confirm the problem and then fixing it with a combination of MBRALIGN and VMware converter as a V2V process
The support guy tells me he seems misalignment when he runs nfsstat -d but MBRSCAN shows these are aligned. What is going on here?
Trying to reduce our CPU and IO burden but getting conflicting information.
Finally I think we should look to upgrade to 8.1.4P2 to remove some bugs? We would consider 8.2.x but I don't think we can as we run Exchange 2010 (using SME 6.x etc)
Thanks in advance
William
I am not actually sure that is an issue...but I was told it is by IBM n-series support. I do have an open case currently escalated from IBM to Netapp regarding the same issue with stale metadata. IBM told me to run sis start -s once, and then sis start manually 2x on each volume.(they specifically said sis start (no switches) volume ) had to be run 2x on each volume and a scheduled run would not suffice) Still have some over 100% stale on several volumes.
Maybe it was just a way to keep me busy for a week. "um yeah, go dedup all your volumes 3 times and come back if it does't help"
More info here, but it sounds like you already got that. https://library.netapp.com/ecmdocs/ECMP1368838/html/GUID-5B6B2A2E-FAFD-4A92-...
Perhaps post your wafltop too and someone might be able to point something out in that.
--Jordan
From: Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk] Sent: Monday, June 16, 2014 3:56 PM To: Jordan Slingerland; marcus.nilsson@atea.se; Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
No deduplications are running
So I have run sis status -l and can confirm the following SIS jobs and their stale fingerprints. This looks pretty bad. What are we doing wrong here?
We have upgraded from various versions of ONTAP last year and believe we ran into SIS issue but thought they had been cleared by running the sis start -S command to clean them out
We are on 8.1.3P2 and came from 8.0.x into many versions of 8.1.x over the last 18 months
Will
8% 1% 79% 14% 0% 52% 130% 108% 126% 117% 181% 112% 121% 81% 7% 0% 0% 0% 0% 25% 0% 26%
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 20:47 To: Marcus Nilsson; Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
It sounds like you ruled out the obvious, but I will say it anyway. but no deduplciations running, right?
And not so obvious, If none running, look at sis status -l and check if any of the volumes are over 20% in the Stale Fingerprints: column.
--Jordan
From: Marcus Nilsson [mailto:marcus.nilsson@atea.se] Sent: Monday, June 16, 2014 3:44 PM To: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk; Jordan Slingerland; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Hi, Might be worth checking out the article at this link http://www.jk-47.com/2014/02/attack-of-old-bugs-netapp-high-cpu/
We ran into this exact issue after upgrading a system from 8.0.3P2 to 8.1.4P1. A process looping in wafl scan blk_reclaim.
BR Marcus
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk Sent: den 16 juni 2014 21:19 To: Jordan.Slingerland@independenthealth.commailto:Jordan.Slingerland@independenthealth.com; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
I am checking all cpus and they are pretty busy
We are in the UK so it's out of hours (and our nightly process are mostly stopped right now)
We have the issue I mentioned where our exchange LUNs are on the same aggregate together and we have a high IO workload with 6000 mailboxes.
This is the sysstat -M 1 right now as an example. It seems high considering there is no de-dupe and only a single snapmirror running (to do a vol move for our exchange separation problem)
Any other thoughts and I am most interested
William
ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 92% 76% 56% 32% 68% 67% 73% 75% 59% 47% 0% 0% 22% 26% 0% 12% 105%( 65%) 0% 5% 0% 34% 9% 13% 9152 0% 98% 90% 80% 60% 85% 87% 89% 88% 75% 31% 0% 0% 26% 42% 0% 15% 122%( 72%) 21% 11% 0% 52% 9% 9% 5637 50% 99% 92% 80% 55% 84% 86% 90% 91% 69% 31% 0% 0% 29% 58% 0% 16% 103%( 66%) 3% 11% 0% 67% 9% 8% 5962 100% 98% 91% 79% 54% 84% 84% 88% 90% 72% 45% 0% 0% 27% 45% 0% 9% 122%( 78%) 0% 12% 0% 54% 8% 10% 7222 100% 99% 94% 88% 67% 89% 91% 94% 94% 77% 25% 0% 0% 28% 63% 0% 21% 100%( 65%) 25% 9% 0% 71% 8% 7% 4452 100% 97% 91% 79% 52% 83% 84% 89% 91% 67% 39% 0% 0% 30% 51% 0% 8% 113%( 73%) 0% 10% 0% 62% 8% 9% 8253 100% 98% 87% 71% 44% 78% 79% 84% 83% 67% 46% 0% 0% 25% 33% 0% 14% 121%( 74%) 0% 9% 0% 42% 12% 11% 9237 66% 97% 93% 86% 65% 88% 87% 92% 93% 80% 29% 0% 0% 27% 50% 0% 22% 116%( 69%) 24% 9% 0% 59% 9% 8% 5213 63% 97% 85% 69% 42% 76% 77% 83% 85% 60% 37% 0% 0% 28% 42% 0% 9% 109%( 69%) 1% 10% 0% 48% 11% 10% 6795 100% 98% 91% 77% 50% 82% 83% 88% 91% 66% 39% 0% 0% 30% 50% 0% 8% 116%( 73%) 0% 10% 0% 58% 7% 10% 6993 100% 98% 92% 82% 62% 86% 85% 90% 91% 78% 29% 0% 0% 28% 51% 0% 20% 108%( 69%) 21% 14% 0% 59% 6% 8% 5308 90% 100% 97% 91% 65% 90% 92% 94% 96% 80% 30% 0% 0% 30% 59% 0% 16% 120%( 76%) 3% 20% 0% 68% 9% 7% 5593 100% 98% 85% 70% 47% 78% 76% 82% 81% 71% 33% 0% 0% 26% 41% 0% 16% 110%( 70%) 4% 12% 0% 48% 10% 10% 5907 79% 100% 98% 89% 61% 89% 91% 94% 96% 75% 28% 0% 0% 32% 62% 0% 20% 98%( 65%) 17% 10% 0% 73% 8% 7% 5290 100% 98% 91% 77% 50% 82% 80% 85% 89% 72% 33% 0% 0% 30% 48% 0% 21% 108%( 64%) 0% 12% 0% 59% 6% 9% 6047 100% 99% 91% 75% 49% 82% 80% 84% 85% 77% 36% 0% 0% 26% 29% 0% 15% 144%( 80%) 1% 12% 0% 44% 10% 10% 6412 67% 100% 95% 88% 68% 90% 88% 94% 97% 80% 26% 0% 0% 29% 59% 0% 26% 100%( 66%) 23% 16% 0% 65% 8% 7% 4602 100% 98% 87% 74% 48% 79% 78% 86% 90% 63% 30% 0% 0% 29% 52% 0% 9% 105%( 68%) 0% 14% 0% 60% 9% 8% 5533 100% 98% 88% 77% 58% 83% 81% 87% 90% 73% 30% 0% 0% 27% 47% 0% 19% 106%( 66%) 21% 10% 0% 54% 9% 8% 5691 98% ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 97% 86% 70% 43% 77% 77% 84% 87% 61% 39% 0% 0% 27% 39% 0% 7% 116%( 73%) 0% 11% 0% 49% 11% 10% 7526 100% 97% 86% 70% 44% 78% 80% 85% 87% 61% 34% 0% 0% 28% 44% 0% 9% 108%( 68%) 0% 14% 0% 53% 11% 13% 6308 100% 98% 87% 77% 59% 82% 80% 86% 88% 76% 28% 0% 0% 24% 44% 0% 23% 106%( 66%) 21% 14% 0% 53% 9% 8% 5200 82% 100% 96% 86% 57% 87% 88% 92% 95% 73% 30% 0% 0% 30% 56% 0% 18% 111%( 69%) 3% 18% 0% 68% 6% 8% 5163 100% 98% 90% 78% 55% 83% 82% 88% 91% 69% 32% 0% 0% 28% 44% 0% 11% 119%( 74%) 6% 19% 0% 54% 9% 9% 6148 99% 100% 97% 89% 64% 89% 91% 93% 96% 75% 34% 0% 0% 30% 62% 0% 22% 99%( 65%) 17% 9% 0% 70% 6% 8% 6496 100%
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 20:14 To: Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Even if it is 10k ops after 5 minutes...that is only 33 ops per second. I doubt 33 unaligned ops per second is your cpu issue.
Maybe you can fix that one top talker just to show support that is not the issue? ...depending how critical that 1 system is that may or may not be worth fighting over support with.
Now, on to the cpu issue. Are using "sysstat -m 1" to look at all cpus and not only the "ANY" cpu metric right?
If you do , for example, "sysstat -x 1" you are looking at the % of time that ANY of your cpus are busy. Seems to me this metric is nearly completely useless.
--Jordan
From: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk] Sent: Monday, June 16, 2014 3:07 PM To: Jordan Slingerland; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Thanks
I reset with the -z switch
I then run -d again a 5 minutes later. Many of the counters are in the 10's so I am happy with this. However 1 server is in the thousands already. This is a windows 2000 server (don't ask please!) which has a misaligned C drive but I have used the "functional aligned" datastore in VSC to get around this. I assume nfsstat -d won't understand that hence the counters in the thousands
William
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 19:57 To: Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
First off, make sure the values in nfsstat -d are actually incrementing significantly by running nfsstat -z to clear the counters and then wait a while and looking at nfsstat -d again.
You may find that you are only doing a handful of unaligned ops and not hundreds or thousands per second.
--Jordan
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk Sent: Monday, June 16, 2014 2:50 PM To: Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: High CPU VM misalignment confusion
Hello. I am hoping you can guide me in the right direction
We have been experiencing very high CPU load on a 7-mode HA pair of 3270 controllers run 8.1.3P2
We have worked with netapp support on these issues and they note our workload is very high on one controller (where we run our VMware setup from)
We also have so called "bad practice" where we are running our exchange ISCSI LUNs on SATA with logs and dbs on the same aggregate (currently separating this out as I type)
I have been told by support we have VMDK misalignment, however I spent a long time a few months ago resolving this firstly by using the VSC tool to confirm the problem and then fixing it with a combination of MBRALIGN and VMware converter as a V2V process
The support guy tells me he seems misalignment when he runs nfsstat -d but MBRSCAN shows these are aligned. What is going on here?
Trying to reduce our CPU and IO burden but getting conflicting information.
Finally I think we should look to upgrade to 8.1.4P2 to remove some bugs? We would consider 8.2.x but I don't think we can as we run Exchange 2010 (using SME 6.x etc)
Thanks in advance
William
Thanks
Happy to run wafltop and dump the ouput for comment. Just the standard volume and process?
How long is it a good idea to leave it running to get a useful output
William
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 21:24 To: Burchell, Will (ITSD); marcus.nilsson@atea.se; Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
I am not actually sure that is an issue...but I was told it is by IBM n-series support. I do have an open case currently escalated from IBM to Netapp regarding the same issue with stale metadata. IBM told me to run sis start -s once, and then sis start manually 2x on each volume.(they specifically said sis start (no switches) volume ) had to be run 2x on each volume and a scheduled run would not suffice) Still have some over 100% stale on several volumes.
Maybe it was just a way to keep me busy for a week. "um yeah, go dedup all your volumes 3 times and come back if it does't help"
More info here, but it sounds like you already got that. https://library.netapp.com/ecmdocs/ECMP1368838/html/GUID-5B6B2A2E-FAFD-4A92-...
Perhaps post your wafltop too and someone might be able to point something out in that.
--Jordan
From: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk] Sent: Monday, June 16, 2014 3:56 PM To: Jordan Slingerland; marcus.nilsson@atea.semailto:marcus.nilsson@atea.se; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
No deduplications are running
So I have run sis status -l and can confirm the following SIS jobs and their stale fingerprints. This looks pretty bad. What are we doing wrong here?
We have upgraded from various versions of ONTAP last year and believe we ran into SIS issue but thought they had been cleared by running the sis start -S command to clean them out
We are on 8.1.3P2 and came from 8.0.x into many versions of 8.1.x over the last 18 months
Will
8% 1% 79% 14% 0% 52% 130% 108% 126% 117% 181% 112% 121% 81% 7% 0% 0% 0% 0% 25% 0% 26%
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 20:47 To: Marcus Nilsson; Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
It sounds like you ruled out the obvious, but I will say it anyway. but no deduplciations running, right?
And not so obvious, If none running, look at sis status -l and check if any of the volumes are over 20% in the Stale Fingerprints: column.
--Jordan
From: Marcus Nilsson [mailto:marcus.nilsson@atea.se] Sent: Monday, June 16, 2014 3:44 PM To: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk; Jordan Slingerland; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Hi, Might be worth checking out the article at this link http://www.jk-47.com/2014/02/attack-of-old-bugs-netapp-high-cpu/
We ran into this exact issue after upgrading a system from 8.0.3P2 to 8.1.4P1. A process looping in wafl scan blk_reclaim.
BR Marcus
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk Sent: den 16 juni 2014 21:19 To: Jordan.Slingerland@independenthealth.commailto:Jordan.Slingerland@independenthealth.com; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
I am checking all cpus and they are pretty busy
We are in the UK so it's out of hours (and our nightly process are mostly stopped right now)
We have the issue I mentioned where our exchange LUNs are on the same aggregate together and we have a high IO workload with 6000 mailboxes.
This is the sysstat -M 1 right now as an example. It seems high considering there is no de-dupe and only a single snapmirror running (to do a vol move for our exchange separation problem)
Any other thoughts and I am most interested
William
ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 92% 76% 56% 32% 68% 67% 73% 75% 59% 47% 0% 0% 22% 26% 0% 12% 105%( 65%) 0% 5% 0% 34% 9% 13% 9152 0% 98% 90% 80% 60% 85% 87% 89% 88% 75% 31% 0% 0% 26% 42% 0% 15% 122%( 72%) 21% 11% 0% 52% 9% 9% 5637 50% 99% 92% 80% 55% 84% 86% 90% 91% 69% 31% 0% 0% 29% 58% 0% 16% 103%( 66%) 3% 11% 0% 67% 9% 8% 5962 100% 98% 91% 79% 54% 84% 84% 88% 90% 72% 45% 0% 0% 27% 45% 0% 9% 122%( 78%) 0% 12% 0% 54% 8% 10% 7222 100% 99% 94% 88% 67% 89% 91% 94% 94% 77% 25% 0% 0% 28% 63% 0% 21% 100%( 65%) 25% 9% 0% 71% 8% 7% 4452 100% 97% 91% 79% 52% 83% 84% 89% 91% 67% 39% 0% 0% 30% 51% 0% 8% 113%( 73%) 0% 10% 0% 62% 8% 9% 8253 100% 98% 87% 71% 44% 78% 79% 84% 83% 67% 46% 0% 0% 25% 33% 0% 14% 121%( 74%) 0% 9% 0% 42% 12% 11% 9237 66% 97% 93% 86% 65% 88% 87% 92% 93% 80% 29% 0% 0% 27% 50% 0% 22% 116%( 69%) 24% 9% 0% 59% 9% 8% 5213 63% 97% 85% 69% 42% 76% 77% 83% 85% 60% 37% 0% 0% 28% 42% 0% 9% 109%( 69%) 1% 10% 0% 48% 11% 10% 6795 100% 98% 91% 77% 50% 82% 83% 88% 91% 66% 39% 0% 0% 30% 50% 0% 8% 116%( 73%) 0% 10% 0% 58% 7% 10% 6993 100% 98% 92% 82% 62% 86% 85% 90% 91% 78% 29% 0% 0% 28% 51% 0% 20% 108%( 69%) 21% 14% 0% 59% 6% 8% 5308 90% 100% 97% 91% 65% 90% 92% 94% 96% 80% 30% 0% 0% 30% 59% 0% 16% 120%( 76%) 3% 20% 0% 68% 9% 7% 5593 100% 98% 85% 70% 47% 78% 76% 82% 81% 71% 33% 0% 0% 26% 41% 0% 16% 110%( 70%) 4% 12% 0% 48% 10% 10% 5907 79% 100% 98% 89% 61% 89% 91% 94% 96% 75% 28% 0% 0% 32% 62% 0% 20% 98%( 65%) 17% 10% 0% 73% 8% 7% 5290 100% 98% 91% 77% 50% 82% 80% 85% 89% 72% 33% 0% 0% 30% 48% 0% 21% 108%( 64%) 0% 12% 0% 59% 6% 9% 6047 100% 99% 91% 75% 49% 82% 80% 84% 85% 77% 36% 0% 0% 26% 29% 0% 15% 144%( 80%) 1% 12% 0% 44% 10% 10% 6412 67% 100% 95% 88% 68% 90% 88% 94% 97% 80% 26% 0% 0% 29% 59% 0% 26% 100%( 66%) 23% 16% 0% 65% 8% 7% 4602 100% 98% 87% 74% 48% 79% 78% 86% 90% 63% 30% 0% 0% 29% 52% 0% 9% 105%( 68%) 0% 14% 0% 60% 9% 8% 5533 100% 98% 88% 77% 58% 83% 81% 87% 90% 73% 30% 0% 0% 27% 47% 0% 19% 106%( 66%) 21% 10% 0% 54% 9% 8% 5691 98% ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 97% 86% 70% 43% 77% 77% 84% 87% 61% 39% 0% 0% 27% 39% 0% 7% 116%( 73%) 0% 11% 0% 49% 11% 10% 7526 100% 97% 86% 70% 44% 78% 80% 85% 87% 61% 34% 0% 0% 28% 44% 0% 9% 108%( 68%) 0% 14% 0% 53% 11% 13% 6308 100% 98% 87% 77% 59% 82% 80% 86% 88% 76% 28% 0% 0% 24% 44% 0% 23% 106%( 66%) 21% 14% 0% 53% 9% 8% 5200 82% 100% 96% 86% 57% 87% 88% 92% 95% 73% 30% 0% 0% 30% 56% 0% 18% 111%( 69%) 3% 18% 0% 68% 6% 8% 5163 100% 98% 90% 78% 55% 83% 82% 88% 91% 69% 32% 0% 0% 28% 44% 0% 11% 119%( 74%) 6% 19% 0% 54% 9% 9% 6148 99% 100% 97% 89% 64% 89% 91% 93% 96% 75% 34% 0% 0% 30% 62% 0% 22% 99%( 65%) 17% 9% 0% 70% 6% 8% 6496 100%
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 20:14 To: Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Even if it is 10k ops after 5 minutes...that is only 33 ops per second. I doubt 33 unaligned ops per second is your cpu issue.
Maybe you can fix that one top talker just to show support that is not the issue? ...depending how critical that 1 system is that may or may not be worth fighting over support with.
Now, on to the cpu issue. Are using "sysstat -m 1" to look at all cpus and not only the "ANY" cpu metric right?
If you do , for example, "sysstat -x 1" you are looking at the % of time that ANY of your cpus are busy. Seems to me this metric is nearly completely useless.
--Jordan
From: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk] Sent: Monday, June 16, 2014 3:07 PM To: Jordan Slingerland; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Thanks
I reset with the -z switch
I then run -d again a 5 minutes later. Many of the counters are in the 10's so I am happy with this. However 1 server is in the thousands already. This is a windows 2000 server (don't ask please!) which has a misaligned C drive but I have used the "functional aligned" datastore in VSC to get around this. I assume nfsstat -d won't understand that hence the counters in the thousands
William
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 19:57 To: Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
First off, make sure the values in nfsstat -d are actually incrementing significantly by running nfsstat -z to clear the counters and then wait a while and looking at nfsstat -d again.
You may find that you are only doing a handful of unaligned ops and not hundreds or thousands per second.
--Jordan
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk Sent: Monday, June 16, 2014 2:50 PM To: Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: High CPU VM misalignment confusion
Hello. I am hoping you can guide me in the right direction
We have been experiencing very high CPU load on a 7-mode HA pair of 3270 controllers run 8.1.3P2
We have worked with netapp support on these issues and they note our workload is very high on one controller (where we run our VMware setup from)
We also have so called "bad practice" where we are running our exchange ISCSI LUNs on SATA with logs and dbs on the same aggregate (currently separating this out as I type)
I have been told by support we have VMDK misalignment, however I spent a long time a few months ago resolving this firstly by using the VSC tool to confirm the problem and then fixing it with a combination of MBRALIGN and VMware converter as a V2V process
The support guy tells me he seems misalignment when he runs nfsstat -d but MBRSCAN shows these are aligned. What is going on here?
Trying to reduce our CPU and IO burden but getting conflicting information.
Finally I think we should look to upgrade to 8.1.4P2 to remove some bugs? We would consider 8.2.x but I don't think we can as we run Exchange 2010 (using SME 6.x etc)
Thanks in advance
William
I suppose it could depend but If I had to throw out a number, 10m would probably be a good start in most situations.
--Jordan
From: Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk] Sent: Monday, June 16, 2014 4:38 PM To: Jordan Slingerland; marcus.nilsson@atea.se; Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Thanks
Happy to run wafltop and dump the ouput for comment. Just the standard volume and process?
How long is it a good idea to leave it running to get a useful output
William
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 21:24 To: Burchell, Will (ITSD); marcus.nilsson@atea.semailto:marcus.nilsson@atea.se; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
I am not actually sure that is an issue...but I was told it is by IBM n-series support. I do have an open case currently escalated from IBM to Netapp regarding the same issue with stale metadata. IBM told me to run sis start -s once, and then sis start manually 2x on each volume.(they specifically said sis start (no switches) volume ) had to be run 2x on each volume and a scheduled run would not suffice) Still have some over 100% stale on several volumes.
Maybe it was just a way to keep me busy for a week. "um yeah, go dedup all your volumes 3 times and come back if it does't help"
More info here, but it sounds like you already got that. https://library.netapp.com/ecmdocs/ECMP1368838/html/GUID-5B6B2A2E-FAFD-4A92-...
Perhaps post your wafltop too and someone might be able to point something out in that.
--Jordan
From: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk] Sent: Monday, June 16, 2014 3:56 PM To: Jordan Slingerland; marcus.nilsson@atea.semailto:marcus.nilsson@atea.se; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
No deduplications are running
So I have run sis status -l and can confirm the following SIS jobs and their stale fingerprints. This looks pretty bad. What are we doing wrong here?
We have upgraded from various versions of ONTAP last year and believe we ran into SIS issue but thought they had been cleared by running the sis start -S command to clean them out
We are on 8.1.3P2 and came from 8.0.x into many versions of 8.1.x over the last 18 months
Will
8% 1% 79% 14% 0% 52% 130% 108% 126% 117% 181% 112% 121% 81% 7% 0% 0% 0% 0% 25% 0% 26%
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 20:47 To: Marcus Nilsson; Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
It sounds like you ruled out the obvious, but I will say it anyway. but no deduplciations running, right?
And not so obvious, If none running, look at sis status -l and check if any of the volumes are over 20% in the Stale Fingerprints: column.
--Jordan
From: Marcus Nilsson [mailto:marcus.nilsson@atea.se] Sent: Monday, June 16, 2014 3:44 PM To: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk; Jordan Slingerland; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Hi, Might be worth checking out the article at this link http://www.jk-47.com/2014/02/attack-of-old-bugs-netapp-high-cpu/
We ran into this exact issue after upgrading a system from 8.0.3P2 to 8.1.4P1. A process looping in wafl scan blk_reclaim.
BR Marcus
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk Sent: den 16 juni 2014 21:19 To: Jordan.Slingerland@independenthealth.commailto:Jordan.Slingerland@independenthealth.com; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
I am checking all cpus and they are pretty busy
We are in the UK so it's out of hours (and our nightly process are mostly stopped right now)
We have the issue I mentioned where our exchange LUNs are on the same aggregate together and we have a high IO workload with 6000 mailboxes.
This is the sysstat -M 1 right now as an example. It seems high considering there is no de-dupe and only a single snapmirror running (to do a vol move for our exchange separation problem)
Any other thoughts and I am most interested
William
ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 92% 76% 56% 32% 68% 67% 73% 75% 59% 47% 0% 0% 22% 26% 0% 12% 105%( 65%) 0% 5% 0% 34% 9% 13% 9152 0% 98% 90% 80% 60% 85% 87% 89% 88% 75% 31% 0% 0% 26% 42% 0% 15% 122%( 72%) 21% 11% 0% 52% 9% 9% 5637 50% 99% 92% 80% 55% 84% 86% 90% 91% 69% 31% 0% 0% 29% 58% 0% 16% 103%( 66%) 3% 11% 0% 67% 9% 8% 5962 100% 98% 91% 79% 54% 84% 84% 88% 90% 72% 45% 0% 0% 27% 45% 0% 9% 122%( 78%) 0% 12% 0% 54% 8% 10% 7222 100% 99% 94% 88% 67% 89% 91% 94% 94% 77% 25% 0% 0% 28% 63% 0% 21% 100%( 65%) 25% 9% 0% 71% 8% 7% 4452 100% 97% 91% 79% 52% 83% 84% 89% 91% 67% 39% 0% 0% 30% 51% 0% 8% 113%( 73%) 0% 10% 0% 62% 8% 9% 8253 100% 98% 87% 71% 44% 78% 79% 84% 83% 67% 46% 0% 0% 25% 33% 0% 14% 121%( 74%) 0% 9% 0% 42% 12% 11% 9237 66% 97% 93% 86% 65% 88% 87% 92% 93% 80% 29% 0% 0% 27% 50% 0% 22% 116%( 69%) 24% 9% 0% 59% 9% 8% 5213 63% 97% 85% 69% 42% 76% 77% 83% 85% 60% 37% 0% 0% 28% 42% 0% 9% 109%( 69%) 1% 10% 0% 48% 11% 10% 6795 100% 98% 91% 77% 50% 82% 83% 88% 91% 66% 39% 0% 0% 30% 50% 0% 8% 116%( 73%) 0% 10% 0% 58% 7% 10% 6993 100% 98% 92% 82% 62% 86% 85% 90% 91% 78% 29% 0% 0% 28% 51% 0% 20% 108%( 69%) 21% 14% 0% 59% 6% 8% 5308 90% 100% 97% 91% 65% 90% 92% 94% 96% 80% 30% 0% 0% 30% 59% 0% 16% 120%( 76%) 3% 20% 0% 68% 9% 7% 5593 100% 98% 85% 70% 47% 78% 76% 82% 81% 71% 33% 0% 0% 26% 41% 0% 16% 110%( 70%) 4% 12% 0% 48% 10% 10% 5907 79% 100% 98% 89% 61% 89% 91% 94% 96% 75% 28% 0% 0% 32% 62% 0% 20% 98%( 65%) 17% 10% 0% 73% 8% 7% 5290 100% 98% 91% 77% 50% 82% 80% 85% 89% 72% 33% 0% 0% 30% 48% 0% 21% 108%( 64%) 0% 12% 0% 59% 6% 9% 6047 100% 99% 91% 75% 49% 82% 80% 84% 85% 77% 36% 0% 0% 26% 29% 0% 15% 144%( 80%) 1% 12% 0% 44% 10% 10% 6412 67% 100% 95% 88% 68% 90% 88% 94% 97% 80% 26% 0% 0% 29% 59% 0% 26% 100%( 66%) 23% 16% 0% 65% 8% 7% 4602 100% 98% 87% 74% 48% 79% 78% 86% 90% 63% 30% 0% 0% 29% 52% 0% 9% 105%( 68%) 0% 14% 0% 60% 9% 8% 5533 100% 98% 88% 77% 58% 83% 81% 87% 90% 73% 30% 0% 0% 27% 47% 0% 19% 106%( 66%) 21% 10% 0% 54% 9% 8% 5691 98% ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 97% 86% 70% 43% 77% 77% 84% 87% 61% 39% 0% 0% 27% 39% 0% 7% 116%( 73%) 0% 11% 0% 49% 11% 10% 7526 100% 97% 86% 70% 44% 78% 80% 85% 87% 61% 34% 0% 0% 28% 44% 0% 9% 108%( 68%) 0% 14% 0% 53% 11% 13% 6308 100% 98% 87% 77% 59% 82% 80% 86% 88% 76% 28% 0% 0% 24% 44% 0% 23% 106%( 66%) 21% 14% 0% 53% 9% 8% 5200 82% 100% 96% 86% 57% 87% 88% 92% 95% 73% 30% 0% 0% 30% 56% 0% 18% 111%( 69%) 3% 18% 0% 68% 6% 8% 5163 100% 98% 90% 78% 55% 83% 82% 88% 91% 69% 32% 0% 0% 28% 44% 0% 11% 119%( 74%) 6% 19% 0% 54% 9% 9% 6148 99% 100% 97% 89% 64% 89% 91% 93% 96% 75% 34% 0% 0% 30% 62% 0% 22% 99%( 65%) 17% 9% 0% 70% 6% 8% 6496 100%
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 20:14 To: Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Even if it is 10k ops after 5 minutes...that is only 33 ops per second. I doubt 33 unaligned ops per second is your cpu issue.
Maybe you can fix that one top talker just to show support that is not the issue? ...depending how critical that 1 system is that may or may not be worth fighting over support with.
Now, on to the cpu issue. Are using "sysstat -m 1" to look at all cpus and not only the "ANY" cpu metric right?
If you do , for example, "sysstat -x 1" you are looking at the % of time that ANY of your cpus are busy. Seems to me this metric is nearly completely useless.
--Jordan
From: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk] Sent: Monday, June 16, 2014 3:07 PM To: Jordan Slingerland; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Thanks
I reset with the -z switch
I then run -d again a 5 minutes later. Many of the counters are in the 10's so I am happy with this. However 1 server is in the thousands already. This is a windows 2000 server (don't ask please!) which has a misaligned C drive but I have used the "functional aligned" datastore in VSC to get around this. I assume nfsstat -d won't understand that hence the counters in the thousands
William
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 19:57 To: Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
First off, make sure the values in nfsstat -d are actually incrementing significantly by running nfsstat -z to clear the counters and then wait a while and looking at nfsstat -d again.
You may find that you are only doing a handful of unaligned ops and not hundreds or thousands per second.
--Jordan
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk Sent: Monday, June 16, 2014 2:50 PM To: Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: High CPU VM misalignment confusion
Hello. I am hoping you can guide me in the right direction
We have been experiencing very high CPU load on a 7-mode HA pair of 3270 controllers run 8.1.3P2
We have worked with netapp support on these issues and they note our workload is very high on one controller (where we run our VMware setup from)
We also have so called "bad practice" where we are running our exchange ISCSI LUNs on SATA with logs and dbs on the same aggregate (currently separating this out as I type)
I have been told by support we have VMDK misalignment, however I spent a long time a few months ago resolving this firstly by using the VSC tool to confirm the problem and then fixing it with a combination of MBRALIGN and VMware converter as a V2V process
The support guy tells me he seems misalignment when he runs nfsstat -d but MBRSCAN shows these are aligned. What is going on here?
Trying to reduce our CPU and IO burden but getting conflicting information.
Finally I think we should look to upgrade to 8.1.4P2 to remove some bugs? We would consider 8.2.x but I don't think we can as we run Exchange 2010 (using SME 6.x etc)
Thanks in advance
William
Really appreciate any input on this one. It may help to note we are running a snapmirror to the aggregate named aggrsata2
Again this is night time here and all our useful snapmirrors, SIS and backups are currently disabled while we troublshoot
Regards
William
--------------------------------
CPU Utilization Percent Application Total STRIPE VOL_LOG VOL_VBN VBN VOL AGGR_VBN AGGR SERIAL XCleaner ----------- -------- -------- -------- -------- -------- -------- -------- -------- -------- ------- other:other:other: 19 11 0 3 0 0 0 0 5 0 aggrsata2::other: 12 0 0 12 0 0 0 0 0 0 aggrvm:L_NFS_VOL0183:nfsv3: 12 12 0 0 0 0 0 0 0 0 aggrsata1:L_LUN_VOL0199:other: 11 0 0 5 0 0 0 4 2 0 aggrsata2::walloc: 5 0 0 0 0 0 0 0 0 5 aggrsata2:SV_L_LUN_VOL0199:scanner: 3 0 0 0 0 0 0 0 3 0 aggrvm:L_NFS_VOL0179:nfsv3: 2 2 0 0 0 0 0 0 0 0 aggrvm:L_NFS_VOL0176:nfsv3: 2 2 0 0 0 0 0 0 0 0 aggrvm::walloc: 1 0 0 0 0 0 1 0 0 0 aggrvm:L_NFS_VOL0224:nfsv3: 1 1 0 0 0 0 0 0 0 0 aggrvm:L_NFS_VOL0178:nfsv3: 1 1 0 0 0 0 0 0 0 0
CPU Time us Application Total STRIPE VOL_LOG VOL_VBN VBN VOL AGGR_VBN AGGR SERIAL XCleaner ----------- -------- -------- -------- -------- -------- -------- -------- -------- -------- ------- aggrvm:L_NFS_VOL0224:nfsv3: 2097 2030 63 4 0 0 0 0 0 0 aggrvm:L_NFS_VOL0158:file i/o: 1418 0 0 0 0 0 0 0 1418 0 aggrvm:L_NFS_VOL0181:file i/o: 1100 0 0 0 0 0 0 0 1100 0 aggrvm:L_NFS_VOL0176:file i/o: 763 0 0 0 0 0 0 0 763 0 aggrvm:L_NFS_VOL0177:file i/o: 696 0 0 0 0 0 0 0 696 0 aggrvm:L_NFS_VOL0183:nfsv3: 693 678 13 2 0 0 0 0 0 0 aggrvm:L_LUN_VOL0152:iscsi: 660 643 14 2 0 0 1 0 0 0 aggrvm:L_NFS_VOL0183:file i/o: 640 0 0 0 0 0 0 0 640 0 aggrvm:L_NFS_VOL0179:file i/o: 599 0 0 0 0 0 0 0 599 0 aggrvm:L_NFS_VOL0180:file i/o: 530 0 0 0 0 0 0 0 530 0 aggr0:vol0:file i/o: 516 212 0 0 0 0 0 0 304 0 aggrsata2:L_NFS_VOL0202:nfsv3: 482 477 0 1 0 0 0 0 4 0 aggr0:vol0:spinvfs: 443 400 0 0 0 0 0 0 43 0 aggrsata1:L_LUN_VOL0200:iscsi: 387 281 47 59 0 0 0 0 0 0 aggrvm:P_NFS_WPF0160:nfsv3: 381 370 5 6 0 0 0 0 0 0 aggrvm:L_NFS_VOL0172:nfsv3: 328 311 0 0 0 0 0 0 17 0 aggrvm:L_NFS_VOL0179:nfsv3: 312 302 6 4 0 0 0 0 0 0 aggrvm:L_NFS_VOL0174:nfsv3: 278 270 4 4 0 0 0 0 0 0 aggrsata1:L_LUN_VOL0201:iscsi: 258 249 8 1 0 0 0 0 0 0 aggrvm:P_NFS_VSW0159:nfsv3: 245 245 0 0 0 0 0 0 0 0
Latency Application Latency ms ----------- ---------- aggrvm:L_NFS_VOL0224:nfsv3: 1.956 aggrsata1:L_LUN_VOL0200:iscsi: 1.833 aggrvm:P_NFS_WPF0160:nfsv3: 1.161 aggrvm:L_NFS_VOL0176:nfsv4: 1.000 aggrvm:L_NFS_VOL0183:nfsv3: 0.972 aggrsata1:L_LUN_VOL0201:iscsi: 0.897 aggrvm:L_NFS_VOL0182:nfsv3: 0.877 aggrvm:L_NFS_VOL0179:nfsv3: 0.742 aggrvm:P_NFS_VSW0159:nfsv3: 0.673 aggrvm:L_NFS_VOL0176:nfsv3: 0.626 aggrvm:L_NFS_VOL0180:nfsv3: 0.623 aggrvm:L_NFS_VOL0174:nfsv3: 0.560 aggr0:vol0:spinvfs: 0.545 aggrvm:L_NFS_VOL0181:nfsv3: 0.530 aggrvm:L_NFS_VOL0178:nfsv3: 0.523 aggrvm:L_NFS_VOL0158:nfsv3: 0.522 aggrvm:L_LUN_VOL0152:iscsi: 0.518 aggrvm:L_NFS_VOL0177:nfsv3: 0.518 aggrvm:L_NFS_VOL0161:nfsv3: 0.380 aggrsata2:L_NFS_VOL0202:nfsv3: 0.351
Application System Latency ms ----------- ----------------- aggrvm:L_NFS_VOL0158:file i/o: 5.000 aggrvm:L_NFS_VOL0183:file i/o: 2.454 aggrvm:L_NFS_VOL0177:file i/o: 2.388 aggrvm:L_NFS_VOL0176:file i/o: 2.387 aggrvm:L_NFS_VOL0180:file i/o: 2.300 aggrvm:L_NFS_VOL0179:file i/o: 2.238 aggrvm:L_NFS_VOL0181:file i/o: 2.000 aggr0:vol0:file i/o: 0.217
I/O utilization ---------MB Read---------- ---------MB Write--------- --------IOs Read---------- --------IOs Write--------- Application MB Total Standard PAM Hybrid Standard PAM Hybrid Standard PAM Hybrid Standard PAM Hybrid ----------- -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- aggrsata1:L_LUN_VOL0199:other: 55754 55625 104 0 25 0 0 1580931 15219 0 0 0 0 aggrsata2::other: 54671 33 16 0 54622 0 0 8193 3909 0 5118 0 0 aggrvm:L_NFS_VOL0183:nfsv3: 50494 45477 1198 0 3819 0 0 1863968 118191 0 0 0 0 aggrvm:L_NFS_VOL0224:nfsv3: 8030 7064 900 0 66 0 0 266065 53707 0 0 0 0 other:other:other: 6539 231 146 0 1 6161 0 40669 36514 0 22879 24646 0 aggrvm:L_NFS_VOL0179:nfsv3: 3927 698 105 0 3124 0 0 90166 15612 0 0 0 0 aggrvm:L_NFS_VOL0176:nfsv3: 3693 1042 787 0 1864 0 0 114329 96367 0 0 0 0 aggrvm:L_NFS_VOL0178:nfsv3: 1353 186 63 0 1104 0 0 31507 13330 0 0 0 0 aggrvm::walloc: 1328 109 211 0 1008 0 0 27837 54005 0 1566 0 0 aggrvm:L_NFS_VOL0180:nfsv3: 1193 73 24 0 1096 0 0 18293 5866 0 0 0 0 aggrvm:L_NFS_VOL0177:nfsv3: 1042 142 12 0 888 0 0 36305 2985 0 0 0 0 aggrvm:L_NFS_VOL0181:nfsv3: 952 94 37 0 821 0 0 23163 5004 0 0 0 0 aggrsata2:SV_L_LUN_VOL0199:scanner: 926 76 838 0 12 0 0 2080 126410 0 0 0 0 aggrvm:L_NFS_VOL0174:nfsv3: 510 375 11 0 124 0 0 19868 1742 0 0 0 0 aggrvm:L_NFS_VOL0161:nfsv3: 434 22 6 0 406 0 0 5487 1646 0 0 0 0 aggrvm:L_NFS_VOL0176:walloc: 405 28 46 0 331 0 0 7068 11782 0 1752 0 0 aggrvm:L_NFS_VOL0183:walloc: 354 64 79 0 211 0 0 16521 20139 0 2468 0 0 aggrvm::other: 352 29 83 0 240 0 0 6688 18089 0 1502 0 0 aggrsata2::walloc: 318 23 18 0 277 0 0 5912 4744 0 10580 0 0 aggrvm:L_NFS_VOL0179:walloc: 313 79 86 0 148 0 0 20176 22026 0 1975 0 0
NVLog Utilization Application NVLog in KB/s ----------- ------------------ aggrsata2::other: 102598 aggrvm:L_NFS_VOL0183:nfsv3: 6890 aggrvm:L_NFS_VOL0179:nfsv3: 5646 aggrvm:L_NFS_VOL0176:nfsv3: 2594 aggrvm:L_NFS_VOL0178:nfsv3: 1716 aggrvm:L_NFS_VOL0180:nfsv3: 1633 aggrvm:L_NFS_VOL0177:nfsv3: 1315 aggrvm:L_NFS_VOL0181:nfsv3: 1117 aggrvm:L_NFS_VOL0161:nfsv3: 716 aggrvm:L_NFS_VOL0182:nfsv3: 451 aggrvm:P_NFS_WPF0160:nfsv3: 307 aggrvm:L_NFS_VOL0158:nfsv3: 189 aggrvm:L_NFS_VOL0174:nfsv3: 149 aggrvm:L_NFS_VOL0224:nfsv3: 60 aggrvm:L_NFS_VOL0183:file i/o: 50 aggrvm:L_NFS_VOL0179:file i/o: 44 aggrvm:L_NFS_VOL0176:file i/o: 19 aggr0:vol0:file i/o: 13 aggrvm:L_NFS_VOL0180:file i/o: 12 aggrsata1:L_LUN_VOL0201:iscsi: 9
Application NVLog_b2b in KB/s ----------- ----------------------
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 21:45 To: Burchell, Will (ITSD); Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
I suppose it could depend but If I had to throw out a number, 10m would probably be a good start in most situations.
--Jordan
From: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk] Sent: Monday, June 16, 2014 4:38 PM To: Jordan Slingerland; marcus.nilsson@atea.semailto:marcus.nilsson@atea.se; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Thanks
Happy to run wafltop and dump the ouput for comment. Just the standard volume and process?
How long is it a good idea to leave it running to get a useful output
William
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 21:24 To: Burchell, Will (ITSD); marcus.nilsson@atea.semailto:marcus.nilsson@atea.se; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
I am not actually sure that is an issue...but I was told it is by IBM n-series support. I do have an open case currently escalated from IBM to Netapp regarding the same issue with stale metadata. IBM told me to run sis start -s once, and then sis start manually 2x on each volume.(they specifically said sis start (no switches) volume ) had to be run 2x on each volume and a scheduled run would not suffice) Still have some over 100% stale on several volumes.
Maybe it was just a way to keep me busy for a week. "um yeah, go dedup all your volumes 3 times and come back if it does't help"
More info here, but it sounds like you already got that. https://library.netapp.com/ecmdocs/ECMP1368838/html/GUID-5B6B2A2E-FAFD-4A92-...
Perhaps post your wafltop too and someone might be able to point something out in that.
--Jordan
From: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk] Sent: Monday, June 16, 2014 3:56 PM To: Jordan Slingerland; marcus.nilsson@atea.semailto:marcus.nilsson@atea.se; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
No deduplications are running
So I have run sis status -l and can confirm the following SIS jobs and their stale fingerprints. This looks pretty bad. What are we doing wrong here?
We have upgraded from various versions of ONTAP last year and believe we ran into SIS issue but thought they had been cleared by running the sis start -S command to clean them out
We are on 8.1.3P2 and came from 8.0.x into many versions of 8.1.x over the last 18 months
Will
8% 1% 79% 14% 0% 52% 130% 108% 126% 117% 181% 112% 121% 81% 7% 0% 0% 0% 0% 25% 0% 26%
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 20:47 To: Marcus Nilsson; Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
It sounds like you ruled out the obvious, but I will say it anyway. but no deduplciations running, right?
And not so obvious, If none running, look at sis status -l and check if any of the volumes are over 20% in the Stale Fingerprints: column.
--Jordan
From: Marcus Nilsson [mailto:marcus.nilsson@atea.se] Sent: Monday, June 16, 2014 3:44 PM To: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk; Jordan Slingerland; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Hi, Might be worth checking out the article at this link http://www.jk-47.com/2014/02/attack-of-old-bugs-netapp-high-cpu/
We ran into this exact issue after upgrading a system from 8.0.3P2 to 8.1.4P1. A process looping in wafl scan blk_reclaim.
BR Marcus
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk Sent: den 16 juni 2014 21:19 To: Jordan.Slingerland@independenthealth.commailto:Jordan.Slingerland@independenthealth.com; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
I am checking all cpus and they are pretty busy
We are in the UK so it's out of hours (and our nightly process are mostly stopped right now)
We have the issue I mentioned where our exchange LUNs are on the same aggregate together and we have a high IO workload with 6000 mailboxes.
This is the sysstat -M 1 right now as an example. It seems high considering there is no de-dupe and only a single snapmirror running (to do a vol move for our exchange separation problem)
Any other thoughts and I am most interested
William
ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 92% 76% 56% 32% 68% 67% 73% 75% 59% 47% 0% 0% 22% 26% 0% 12% 105%( 65%) 0% 5% 0% 34% 9% 13% 9152 0% 98% 90% 80% 60% 85% 87% 89% 88% 75% 31% 0% 0% 26% 42% 0% 15% 122%( 72%) 21% 11% 0% 52% 9% 9% 5637 50% 99% 92% 80% 55% 84% 86% 90% 91% 69% 31% 0% 0% 29% 58% 0% 16% 103%( 66%) 3% 11% 0% 67% 9% 8% 5962 100% 98% 91% 79% 54% 84% 84% 88% 90% 72% 45% 0% 0% 27% 45% 0% 9% 122%( 78%) 0% 12% 0% 54% 8% 10% 7222 100% 99% 94% 88% 67% 89% 91% 94% 94% 77% 25% 0% 0% 28% 63% 0% 21% 100%( 65%) 25% 9% 0% 71% 8% 7% 4452 100% 97% 91% 79% 52% 83% 84% 89% 91% 67% 39% 0% 0% 30% 51% 0% 8% 113%( 73%) 0% 10% 0% 62% 8% 9% 8253 100% 98% 87% 71% 44% 78% 79% 84% 83% 67% 46% 0% 0% 25% 33% 0% 14% 121%( 74%) 0% 9% 0% 42% 12% 11% 9237 66% 97% 93% 86% 65% 88% 87% 92% 93% 80% 29% 0% 0% 27% 50% 0% 22% 116%( 69%) 24% 9% 0% 59% 9% 8% 5213 63% 97% 85% 69% 42% 76% 77% 83% 85% 60% 37% 0% 0% 28% 42% 0% 9% 109%( 69%) 1% 10% 0% 48% 11% 10% 6795 100% 98% 91% 77% 50% 82% 83% 88% 91% 66% 39% 0% 0% 30% 50% 0% 8% 116%( 73%) 0% 10% 0% 58% 7% 10% 6993 100% 98% 92% 82% 62% 86% 85% 90% 91% 78% 29% 0% 0% 28% 51% 0% 20% 108%( 69%) 21% 14% 0% 59% 6% 8% 5308 90% 100% 97% 91% 65% 90% 92% 94% 96% 80% 30% 0% 0% 30% 59% 0% 16% 120%( 76%) 3% 20% 0% 68% 9% 7% 5593 100% 98% 85% 70% 47% 78% 76% 82% 81% 71% 33% 0% 0% 26% 41% 0% 16% 110%( 70%) 4% 12% 0% 48% 10% 10% 5907 79% 100% 98% 89% 61% 89% 91% 94% 96% 75% 28% 0% 0% 32% 62% 0% 20% 98%( 65%) 17% 10% 0% 73% 8% 7% 5290 100% 98% 91% 77% 50% 82% 80% 85% 89% 72% 33% 0% 0% 30% 48% 0% 21% 108%( 64%) 0% 12% 0% 59% 6% 9% 6047 100% 99% 91% 75% 49% 82% 80% 84% 85% 77% 36% 0% 0% 26% 29% 0% 15% 144%( 80%) 1% 12% 0% 44% 10% 10% 6412 67% 100% 95% 88% 68% 90% 88% 94% 97% 80% 26% 0% 0% 29% 59% 0% 26% 100%( 66%) 23% 16% 0% 65% 8% 7% 4602 100% 98% 87% 74% 48% 79% 78% 86% 90% 63% 30% 0% 0% 29% 52% 0% 9% 105%( 68%) 0% 14% 0% 60% 9% 8% 5533 100% 98% 88% 77% 58% 83% 81% 87% 90% 73% 30% 0% 0% 27% 47% 0% 19% 106%( 66%) 21% 10% 0% 54% 9% 8% 5691 98% ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 97% 86% 70% 43% 77% 77% 84% 87% 61% 39% 0% 0% 27% 39% 0% 7% 116%( 73%) 0% 11% 0% 49% 11% 10% 7526 100% 97% 86% 70% 44% 78% 80% 85% 87% 61% 34% 0% 0% 28% 44% 0% 9% 108%( 68%) 0% 14% 0% 53% 11% 13% 6308 100% 98% 87% 77% 59% 82% 80% 86% 88% 76% 28% 0% 0% 24% 44% 0% 23% 106%( 66%) 21% 14% 0% 53% 9% 8% 5200 82% 100% 96% 86% 57% 87% 88% 92% 95% 73% 30% 0% 0% 30% 56% 0% 18% 111%( 69%) 3% 18% 0% 68% 6% 8% 5163 100% 98% 90% 78% 55% 83% 82% 88% 91% 69% 32% 0% 0% 28% 44% 0% 11% 119%( 74%) 6% 19% 0% 54% 9% 9% 6148 99% 100% 97% 89% 64% 89% 91% 93% 96% 75% 34% 0% 0% 30% 62% 0% 22% 99%( 65%) 17% 9% 0% 70% 6% 8% 6496 100%
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 20:14 To: Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Even if it is 10k ops after 5 minutes...that is only 33 ops per second. I doubt 33 unaligned ops per second is your cpu issue.
Maybe you can fix that one top talker just to show support that is not the issue? ...depending how critical that 1 system is that may or may not be worth fighting over support with.
Now, on to the cpu issue. Are using "sysstat -m 1" to look at all cpus and not only the "ANY" cpu metric right?
If you do , for example, "sysstat -x 1" you are looking at the % of time that ANY of your cpus are busy. Seems to me this metric is nearly completely useless.
--Jordan
From: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk] Sent: Monday, June 16, 2014 3:07 PM To: Jordan Slingerland; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Thanks
I reset with the -z switch
I then run -d again a 5 minutes later. Many of the counters are in the 10's so I am happy with this. However 1 server is in the thousands already. This is a windows 2000 server (don't ask please!) which has a misaligned C drive but I have used the "functional aligned" datastore in VSC to get around this. I assume nfsstat -d won't understand that hence the counters in the thousands
William
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 19:57 To: Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
First off, make sure the values in nfsstat -d are actually incrementing significantly by running nfsstat -z to clear the counters and then wait a while and looking at nfsstat -d again.
You may find that you are only doing a handful of unaligned ops and not hundreds or thousands per second.
--Jordan
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk Sent: Monday, June 16, 2014 2:50 PM To: Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: High CPU VM misalignment confusion
Hello. I am hoping you can guide me in the right direction
We have been experiencing very high CPU load on a 7-mode HA pair of 3270 controllers run 8.1.3P2
We have worked with netapp support on these issues and they note our workload is very high on one controller (where we run our VMware setup from)
We also have so called "bad practice" where we are running our exchange ISCSI LUNs on SATA with logs and dbs on the same aggregate (currently separating this out as I type)
I have been told by support we have VMDK misalignment, however I spent a long time a few months ago resolving this firstly by using the VSC tool to confirm the problem and then fixing it with a combination of MBRALIGN and VMware converter as a V2V process
The support guy tells me he seems misalignment when he runs nfsstat -d but MBRSCAN shows these are aligned. What is going on here?
Trying to reduce our CPU and IO burden but getting conflicting information.
Finally I think we should look to upgrade to 8.1.4P2 to remove some bugs? We would consider 8.2.x but I don't think we can as we run Exchange 2010 (using SME 6.x etc)
Thanks in advance
William
One final thing I notice is that snapshots are taking sometime to delete which I know is an intensive operations in DOT
Example (at the bottom)
snapid status date ownblks release fsRev name ------ ------ ------------ ------- ------- ----- -------- 61 complete Jun 16 15:49 334938 8.1 22331 2014-06-16_1548+0100_daily 60 complete Jun 16 15:48 765 8.1 22331 uk-su-ss007(1575077051)_SV_L_LUN_VOL0199-base.788 44 complete Jun 16 14:08 5415287 8.1 22331 2014-06-16_1408+0100_daily 37 complete Jun 16 10:45 11877364 8.1 22331 2014-06-16_1044+0100_daily 12 complete Jun 15 23:38 16471069 8.1 22331 2014-06-15_2338+0100_daily 250 complete Jun 15 23:36 6949 8.1 22331 2014-06-15_2334+0100_daily 90 complete Jun 15 23:30 16455 8.1 22331 uk-su-ss009(1575007753)_SV_M_LUN_VOL0199.120 58 complete Jun 15 23:29 1044 8.1 22331 2014-06-15_2328+0100_daily 226 complete Jun 15 23:23 40544 8.1 22331 2014-06-15_2322+0100_daily 216 complete Jun 15 15:10 4498948 8.1 22331 2014-06-15_1509+0100_daily 208 complete Jun 15 12:09 2605999 8.1 22331 2014-06-15_1208+0100_daily 196 complete Jun 15 09:24 2187941 8.1 22331 2014-06-15_0924+0100_daily 184 complete Jun 15 01:20 6696250 8.1 22331 2014-06-15_0120+0100_daily 176 complete Jun 15 01:18 5461 8.1 22331 2014-06-15_0117+0100_daily 161 complete Jun 15 01:14 32047 8.1 22331 2014-06-15_0113+0100_daily 148 complete Jun 15 01:11 56057 8.1 22331 2014-06-15_0111+0100_daily 124 complete Jun 14 15:09 4817241 8.1 22331 2014-06-14_1508+0100_daily 112 complete Jun 14 12:10 3349333 8.1 22331 2014-06-14_1210+0100_daily 94 complete Jun 14 10:03 3527119 8.1 22331 2014-06-14_1002+0100_daily 75 complete Jun 13 22:31 20042240 8.1 22331 2014-06-13_2229+0100_daily 69 complete Jun 13 22:27 4954 8.1 22331 2014-06-13_2226+0100_daily 52 complete Jun 13 21:36 343477 8.1 22331 2014-06-13_2135+0100_daily 41 complete Jun 13 20:54 8632101 8.1 22331 2014-06-13_2054+0100_daily 33 complete Jun 13 20:35 4298937 8.1 22331 2014-06-13_2034+0100_daily 19 complete Jun 13 20:32 142680 8.1 22331 2014-06-13_2031+0100_daily 1 complete Jun 13 11:09 6691044 8.1 22331 2014-06-13_1108+0100_daily 238 complete Jun 12 21:46 19191567 8.1 22331 2014-06-12_2145+0100_daily 224 complete Jun 12 21:05 9997690 8.1 22331 2014-06-12_2104+0100_daily 126 complete Jun 08 23:02 109922787 8.1 22331 2014-06-08_2301+0100_daily 130 complete Jun 08 22:58 4925 8.1 22331 2014-06-08_2257+0100_daily 179 complete Jun 08 22:52 21545 8.1 22331 2014-06-08_2251+0100_daily 167 complete Jun 08 22:44 41255 8.1 22331 2014-06-08_2243+0100_daily 85 complete Jun 08 01:05 11868850 8.1 22331 2014-06-08_0104+0100_daily 63 complete Jun 08 01:00 5647 8.1 22331 2014-06-08_0059+0100_daily 144 complete Jun 08 00:55 32549 8.1 22331 2014-06-08_0054+0100_daily 99 complete Jun 08 00:47 38928 8.1 22331 2014-06-08_0045+0100_daily 152 complete Jun 07 00:58 18053974 8.1 22331 2014-06-07_0057+0100_daily 228 complete Jun 07 00:54 4288 8.1 22331 2014-06-07_0052+0100_daily 54 complete Jun 07 00:46 61614 8.1 22331 2014-06-07_0045+0100_daily 15 complete Jun 07 00:39 73153 8.1 22331 2014-06-07_0038+0100_daily 139 complete Jun 06 01:38 34035727 8.1 22331 2014-06-06_0137+0100_daily 49 complete Jun 06 01:35 5386 8.1 22331 2014-06-06_0134+0100_daily 241 complete Jun 06 01:31 43381 8.1 22331 2014-06-06_0131+0100_daily 34 complete Jun 06 01:28 60157 8.1 22331 2014-06-06_0127+0100_daily 25 complete Jun 05 01:57 32400189 8.1 22331 2014-06-05_0156+0100_daily 77 complete Jun 05 01:55 4076 8.1 22331 2014-06-05_0154+0100_daily 200 complete Jun 05 01:51 75258 8.1 22331 2014-06-05_0151+0100_daily 132 complete Jun 05 01:48 85317 8.1 22331 2014-06-05_0147+0100_daily 9 complete Jun 04 00:00 39722504 8.1 22331 2014-06-03_2359+0100_daily 198 complete Jun 03 23:57 5421 8.1 22331 2014-06-03_2355+0100_daily 253 complete Jun 03 23:51 71954 8.1 22331 2014-06-03_2350+0100_daily 13 complete Jun 03 23:45 102485 8.1 22331 2014-06-03_2343+0100_daily 146 complete Jun 03 15:55 9263616 8.1 22331 2014-06-03_1555+0100_daily 128 complete Jun 03 15:14 4950630 8.1 22331 2014-06-03_1514+0100_daily 97 complete Jun 03 00:46 22248178 8.1 22331 2014-06-03_0045+0100_daily 166 complete Jun 03 00:43 4495 8.1 22331 2014-06-03_0042+0100_daily 104 complete Jun 03 00:37 68909 8.1 22331 2014-06-03_0036+0100_daily 70 complete Jun 03 00:31 145138 8.1 22331 2014-06-03_0030+0100_daily 254 complete Jun 02 21:20 35346082 8.1 22331 2014-06-02_2120+0100_daily 202 complete Jun 02 21:19 4114 8.1 22331 2014-06-02_2118+0100_daily 53 complete Jun 02 05:09 528753 8.1 22331 2014-06-02_0509+0100_daily 249 complete Jun 02 05:06 50924 8.1 22331 2014-06-02_0506+0100_daily 50 complete May 28 01:54 64891840 8.1 22331 2014-05-28_0154+0100_daily 5 complete May 28 01:51 6479 8.1 22331 2014-05-28_0150+0100_daily 151 complete May 28 01:49 70924 8.1 22331 2014-05-28_0148+0100_daily 28 complete May 28 01:46 79695 8.1 22331 2014-05-28_0146+0100_daily 129 complete May 27 15:55 10422515 8.1 22331 2014-05-27_1554+0100_daily 16 complete May 27 13:02 9773150 8.1 22331 2014-05-27_1301+0100_daily 157 complete May 27 10:27 10143739 8.1 22331 2014-05-27_1027+0100_daily 105 complete May 26 22:44 14249999 8.1 22331 2014-05-26_2243+0100_daily 252 complete May 26 22:41 7559 8.1 22331 2014-05-26_2240+0100_daily 160 complete May 26 22:37 27078 8.1 22331 2014-05-26_2236+0100_daily 23 complete May 26 22:31 44864 8.1 22331 2014-05-26_2230+0100_daily 232 complete May 26 15:37 4273421 8.1 22331 2014-05-26_1537+0100_daily 136 complete May 26 12:40 2211938 8.1 22331 2014-05-26_1240+0100_daily 91 complete May 26 09:47 3502863 8.1 22331 2014-05-26_0947+0100_daily 21 complete May 25 22:29 4984748 8.1 22331 2014-05-25_2228+0100_daily 4 complete May 25 22:26 6579 8.1 22331 2014-05-25_2225+0100_daily 133 complete May 25 22:21 19720 8.1 22331 2014-05-25_2220+0100_daily 51 complete May 25 22:17 29593 8.1 22331 2014-05-25_2216+0100_daily 194 complete May 25 15:38 2568742 8.1 22331 2014-05-25_1537+0100_daily 127 complete May 25 12:37 2007218 8.1 22331 2014-05-25_1237+0100_daily 30 complete May 25 09:45 1702320 8.1 22331 2014-05-25_0944+0100_daily 172 complete May 24 22:19 4880430 8.1 22331 2014-05-24_2218+0100_daily 143 complete May 24 22:16 4764 8.1 22331 2014-05-24_2215+0100_daily 171 complete May 24 22:08 26268 8.1 22331 2014-05-24_2207+0100_daily 115 complete May 24 22:02 37012 8.1 22331 2014-05-24_2201+0100_daily 3 complete May 24 15:37 3208176 8.1 22331 2014-05-24_1536+0100_daily 163 complete May 24 12:40 2092982 8.1 22331 2014-05-24_1239+0100_daily 246 complete May 24 09:54 2258149 8.1 22331 2014-05-24_0954+0100_daily 164 complete May 24 00:44 7307421 8.1 22331 2014-05-24_0042+0100_daily 173 complete May 24 00:39 8171 8.1 22331 2014-05-24_0037+0100_daily 65 complete May 24 00:34 63931 8.1 22331 2014-05-24_0034+0100_daily 188 complete May 24 00:27 58960 8.1 22331 2014-05-24_0026+0100_daily 153 complete May 23 16:02 8251951 8.1 22331 2014-05-23_1602+0100_daily 110 complete May 23 13:00 8945685 8.1 22331 2014-05-23_1259+0100_daily 87 complete May 23 10:32 9701026 8.1 22331 2014-05-23_1031+0100_daily 14 complete May 23 01:19 17123971 8.1 22331 2014-05-23_0118+0100_daily 235 complete May 23 01:17 10509 8.1 22331 2014-05-23_0117+0100_daily 111 complete May 23 01:15 13058 8.1 22331 2014-05-23_0115+0100_daily 46 complete May 23 01:13 239438 8.1 22331 2014-05-23_0112+0100_daily 230 complete May 22 02:20 39300529 8.1 22331 2014-05-22_0220+0100_daily 89 complete May 22 02:18 5642 8.1 22331 2014-05-22_0217+0100_daily 32 complete May 22 02:14 41995 8.1 22331 2014-05-22_0213+0100_daily 22 complete May 22 02:11 174295 8.1 22331 2014-05-22_0210+0100_daily 93 complete May 21 01:36 42877364 8.1 22331 2014-05-21_0136+0100_daily 255 complete May 21 01:33 7137 8.1 22331 2014-05-21_0132+0100_daily 165 complete May 21 01:29 72219 8.1 22331 2014-05-21_0128+0100_daily 95 complete May 21 01:25 97763 8.1 22331 2014-05-21_0124+0100_daily 212 complete May 20 02:24 36695419 8.1 22331 2014-05-20_0223+0100_daily 177 complete May 20 02:22 2988 8.1 22331 2014-05-20_0222+0100_daily 131 complete May 20 02:18 42998 8.1 22331 2014-05-20_0218+0100_daily 227 complete May 20 02:15 78827 8.1 22331 2014-05-20_0214+0100_daily 155 complete May 18 23:07 39970911 8.1 22331 2014-05-18_2306+0100_daily 123 complete May 18 23:04 5246 8.1 22331 2014-05-18_2302+0100_daily 102 complete May 18 22:57 16970 8.1 22331 2014-05-18_2257+0100_daily 81 complete May 18 22:53 31928 8.1 22331 2014-05-18_2252+0100_daily 251 complete May 17 23:34 14464864 8.1 22331 2014-05-17_2333+0100_daily 147 complete May 17 23:30 8776 8.1 22331 2014-05-17_2328+0100_daily 45 complete May 17 23:25 16371 8.1 22331 2014-05-17_2324+0100_daily 223 complete May 17 23:19 36615 8.1 22331 2014-05-17_2318+0100_daily 98 complete May 16 23:21 20601283 8.1 22331 2014-05-16_2320+0100_daily 66 complete May 16 23:17 6112 8.1 22331 2014-05-16_2316+0100_daily 27 complete May 16 23:10 24164 8.1 22331 2014-05-16_2310+0100_daily 17 complete May 16 23:04 102933 8.1 22331 2014-05-16_2302+0100_daily 225 complete May 15 23:15 34033701 8.1 22331 2014-05-15_2314+0100_daily 187 complete May 15 23:12 5472 8.1 22331 2014-05-15_2311+0100_daily 96 complete May 15 23:07 22786 8.1 22331 2014-05-15_2306+0100_daily 10 complete May 15 23:01 71324 8.1 22331 2014-05-15_2300+0100_daily 191 complete May 14 23:28 33533416 8.1 22331 2014-05-14_2327+0100_daily 80 complete May 14 23:23 6759 8.1 22331 2014-05-14_2322+0100_daily 231 complete May 14 23:17 25124 8.1 22331 2014-05-14_2316+0100_daily 183 complete May 14 23:10 64807 8.1 22331 2014-05-14_2309+0100_daily 118 complete May 13 23:26 32621854 8.1 22331 2014-05-13_2325+0100_daily 203 complete May 13 23:21 5396 8.1 22331 2014-05-13_2320+0100_daily 154 complete May 13 23:14 20931 8.1 22331 2014-05-13_2313+0100_daily 134 complete May 13 23:08 51678 8.1 22331 2014-05-13_2307+0100_daily 101 deleting May 13 01:43 33375746 134.0 (319143/319143 remaining) 67 deleting May 13 01:41 3210 134.0 (319143/319143 remaining) 245 deleting May 13 01:37 0 67.0 (143401/319143 remaining)
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 21:45 To: Burchell, Will (ITSD); Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
I suppose it could depend but If I had to throw out a number, 10m would probably be a good start in most situations.
--Jordan
From: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk] Sent: Monday, June 16, 2014 4:38 PM To: Jordan Slingerland; marcus.nilsson@atea.semailto:marcus.nilsson@atea.se; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Thanks
Happy to run wafltop and dump the ouput for comment. Just the standard volume and process?
How long is it a good idea to leave it running to get a useful output
William
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 21:24 To: Burchell, Will (ITSD); marcus.nilsson@atea.semailto:marcus.nilsson@atea.se; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
I am not actually sure that is an issue...but I was told it is by IBM n-series support. I do have an open case currently escalated from IBM to Netapp regarding the same issue with stale metadata. IBM told me to run sis start -s once, and then sis start manually 2x on each volume.(they specifically said sis start (no switches) volume ) had to be run 2x on each volume and a scheduled run would not suffice) Still have some over 100% stale on several volumes.
Maybe it was just a way to keep me busy for a week. "um yeah, go dedup all your volumes 3 times and come back if it does't help"
More info here, but it sounds like you already got that. https://library.netapp.com/ecmdocs/ECMP1368838/html/GUID-5B6B2A2E-FAFD-4A92-...
Perhaps post your wafltop too and someone might be able to point something out in that.
--Jordan
From: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk] Sent: Monday, June 16, 2014 3:56 PM To: Jordan Slingerland; marcus.nilsson@atea.semailto:marcus.nilsson@atea.se; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
No deduplications are running
So I have run sis status -l and can confirm the following SIS jobs and their stale fingerprints. This looks pretty bad. What are we doing wrong here?
We have upgraded from various versions of ONTAP last year and believe we ran into SIS issue but thought they had been cleared by running the sis start -S command to clean them out
We are on 8.1.3P2 and came from 8.0.x into many versions of 8.1.x over the last 18 months
Will
8% 1% 79% 14% 0% 52% 130% 108% 126% 117% 181% 112% 121% 81% 7% 0% 0% 0% 0% 25% 0% 26%
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 20:47 To: Marcus Nilsson; Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
It sounds like you ruled out the obvious, but I will say it anyway. but no deduplciations running, right?
And not so obvious, If none running, look at sis status -l and check if any of the volumes are over 20% in the Stale Fingerprints: column.
--Jordan
From: Marcus Nilsson [mailto:marcus.nilsson@atea.se] Sent: Monday, June 16, 2014 3:44 PM To: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk; Jordan Slingerland; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Hi, Might be worth checking out the article at this link http://www.jk-47.com/2014/02/attack-of-old-bugs-netapp-high-cpu/
We ran into this exact issue after upgrading a system from 8.0.3P2 to 8.1.4P1. A process looping in wafl scan blk_reclaim.
BR Marcus
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk Sent: den 16 juni 2014 21:19 To: Jordan.Slingerland@independenthealth.commailto:Jordan.Slingerland@independenthealth.com; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
I am checking all cpus and they are pretty busy
We are in the UK so it's out of hours (and our nightly process are mostly stopped right now)
We have the issue I mentioned where our exchange LUNs are on the same aggregate together and we have a high IO workload with 6000 mailboxes.
This is the sysstat -M 1 right now as an example. It seems high considering there is no de-dupe and only a single snapmirror running (to do a vol move for our exchange separation problem)
Any other thoughts and I am most interested
William
ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 92% 76% 56% 32% 68% 67% 73% 75% 59% 47% 0% 0% 22% 26% 0% 12% 105%( 65%) 0% 5% 0% 34% 9% 13% 9152 0% 98% 90% 80% 60% 85% 87% 89% 88% 75% 31% 0% 0% 26% 42% 0% 15% 122%( 72%) 21% 11% 0% 52% 9% 9% 5637 50% 99% 92% 80% 55% 84% 86% 90% 91% 69% 31% 0% 0% 29% 58% 0% 16% 103%( 66%) 3% 11% 0% 67% 9% 8% 5962 100% 98% 91% 79% 54% 84% 84% 88% 90% 72% 45% 0% 0% 27% 45% 0% 9% 122%( 78%) 0% 12% 0% 54% 8% 10% 7222 100% 99% 94% 88% 67% 89% 91% 94% 94% 77% 25% 0% 0% 28% 63% 0% 21% 100%( 65%) 25% 9% 0% 71% 8% 7% 4452 100% 97% 91% 79% 52% 83% 84% 89% 91% 67% 39% 0% 0% 30% 51% 0% 8% 113%( 73%) 0% 10% 0% 62% 8% 9% 8253 100% 98% 87% 71% 44% 78% 79% 84% 83% 67% 46% 0% 0% 25% 33% 0% 14% 121%( 74%) 0% 9% 0% 42% 12% 11% 9237 66% 97% 93% 86% 65% 88% 87% 92% 93% 80% 29% 0% 0% 27% 50% 0% 22% 116%( 69%) 24% 9% 0% 59% 9% 8% 5213 63% 97% 85% 69% 42% 76% 77% 83% 85% 60% 37% 0% 0% 28% 42% 0% 9% 109%( 69%) 1% 10% 0% 48% 11% 10% 6795 100% 98% 91% 77% 50% 82% 83% 88% 91% 66% 39% 0% 0% 30% 50% 0% 8% 116%( 73%) 0% 10% 0% 58% 7% 10% 6993 100% 98% 92% 82% 62% 86% 85% 90% 91% 78% 29% 0% 0% 28% 51% 0% 20% 108%( 69%) 21% 14% 0% 59% 6% 8% 5308 90% 100% 97% 91% 65% 90% 92% 94% 96% 80% 30% 0% 0% 30% 59% 0% 16% 120%( 76%) 3% 20% 0% 68% 9% 7% 5593 100% 98% 85% 70% 47% 78% 76% 82% 81% 71% 33% 0% 0% 26% 41% 0% 16% 110%( 70%) 4% 12% 0% 48% 10% 10% 5907 79% 100% 98% 89% 61% 89% 91% 94% 96% 75% 28% 0% 0% 32% 62% 0% 20% 98%( 65%) 17% 10% 0% 73% 8% 7% 5290 100% 98% 91% 77% 50% 82% 80% 85% 89% 72% 33% 0% 0% 30% 48% 0% 21% 108%( 64%) 0% 12% 0% 59% 6% 9% 6047 100% 99% 91% 75% 49% 82% 80% 84% 85% 77% 36% 0% 0% 26% 29% 0% 15% 144%( 80%) 1% 12% 0% 44% 10% 10% 6412 67% 100% 95% 88% 68% 90% 88% 94% 97% 80% 26% 0% 0% 29% 59% 0% 26% 100%( 66%) 23% 16% 0% 65% 8% 7% 4602 100% 98% 87% 74% 48% 79% 78% 86% 90% 63% 30% 0% 0% 29% 52% 0% 9% 105%( 68%) 0% 14% 0% 60% 9% 8% 5533 100% 98% 88% 77% 58% 83% 81% 87% 90% 73% 30% 0% 0% 27% 47% 0% 19% 106%( 66%) 21% 10% 0% 54% 9% 8% 5691 98% ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 97% 86% 70% 43% 77% 77% 84% 87% 61% 39% 0% 0% 27% 39% 0% 7% 116%( 73%) 0% 11% 0% 49% 11% 10% 7526 100% 97% 86% 70% 44% 78% 80% 85% 87% 61% 34% 0% 0% 28% 44% 0% 9% 108%( 68%) 0% 14% 0% 53% 11% 13% 6308 100% 98% 87% 77% 59% 82% 80% 86% 88% 76% 28% 0% 0% 24% 44% 0% 23% 106%( 66%) 21% 14% 0% 53% 9% 8% 5200 82% 100% 96% 86% 57% 87% 88% 92% 95% 73% 30% 0% 0% 30% 56% 0% 18% 111%( 69%) 3% 18% 0% 68% 6% 8% 5163 100% 98% 90% 78% 55% 83% 82% 88% 91% 69% 32% 0% 0% 28% 44% 0% 11% 119%( 74%) 6% 19% 0% 54% 9% 9% 6148 99% 100% 97% 89% 64% 89% 91% 93% 96% 75% 34% 0% 0% 30% 62% 0% 22% 99%( 65%) 17% 9% 0% 70% 6% 8% 6496 100%
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 20:14 To: Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Even if it is 10k ops after 5 minutes...that is only 33 ops per second. I doubt 33 unaligned ops per second is your cpu issue.
Maybe you can fix that one top talker just to show support that is not the issue? ...depending how critical that 1 system is that may or may not be worth fighting over support with.
Now, on to the cpu issue. Are using "sysstat -m 1" to look at all cpus and not only the "ANY" cpu metric right?
If you do , for example, "sysstat -x 1" you are looking at the % of time that ANY of your cpus are busy. Seems to me this metric is nearly completely useless.
--Jordan
From: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk] Sent: Monday, June 16, 2014 3:07 PM To: Jordan Slingerland; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Thanks
I reset with the -z switch
I then run -d again a 5 minutes later. Many of the counters are in the 10's so I am happy with this. However 1 server is in the thousands already. This is a windows 2000 server (don't ask please!) which has a misaligned C drive but I have used the "functional aligned" datastore in VSC to get around this. I assume nfsstat -d won't understand that hence the counters in the thousands
William
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 19:57 To: Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
First off, make sure the values in nfsstat -d are actually incrementing significantly by running nfsstat -z to clear the counters and then wait a while and looking at nfsstat -d again.
You may find that you are only doing a handful of unaligned ops and not hundreds or thousands per second.
--Jordan
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk Sent: Monday, June 16, 2014 2:50 PM To: Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: High CPU VM misalignment confusion
Hello. I am hoping you can guide me in the right direction
We have been experiencing very high CPU load on a 7-mode HA pair of 3270 controllers run 8.1.3P2
We have worked with netapp support on these issues and they note our workload is very high on one controller (where we run our VMware setup from)
We also have so called "bad practice" where we are running our exchange ISCSI LUNs on SATA with logs and dbs on the same aggregate (currently separating this out as I type)
I have been told by support we have VMDK misalignment, however I spent a long time a few months ago resolving this firstly by using the VSC tool to confirm the problem and then fixing it with a combination of MBRALIGN and VMware converter as a V2V process
The support guy tells me he seems misalignment when he runs nfsstat -d but MBRSCAN shows these are aligned. What is going on here?
Trying to reduce our CPU and IO burden but getting conflicting information.
Finally I think we should look to upgrade to 8.1.4P2 to remove some bugs? We would consider 8.2.x but I don't think we can as we run Exchange 2010 (using SME 6.x etc)
Thanks in advance
William
Marcus thanks for your reply. I didn't get it yet via email here
Reading your blog article is interesting
I guess you picked up on the following being so high [WAFL_Ex(Kahu)] cpu
I wonder if it's the SIS stale fingerprints mentioned by Jordan but will check your link and our setup
Thank you
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 20:47 To: Marcus Nilsson; Burchell, Will (ITSD); Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
It sounds like you ruled out the obvious, but I will say it anyway. but no deduplciations running, right?
And not so obvious, If none running, look at sis status -l and check if any of the volumes are over 20% in the Stale Fingerprints: column.
--Jordan
From: Marcus Nilsson [mailto:marcus.nilsson@atea.se] Sent: Monday, June 16, 2014 3:44 PM To: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk; Jordan Slingerland; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Hi, Might be worth checking out the article at this link http://www.jk-47.com/2014/02/attack-of-old-bugs-netapp-high-cpu/
We ran into this exact issue after upgrading a system from 8.0.3P2 to 8.1.4P1. A process looping in wafl scan blk_reclaim.
BR Marcus
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk Sent: den 16 juni 2014 21:19 To: Jordan.Slingerland@independenthealth.commailto:Jordan.Slingerland@independenthealth.com; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
I am checking all cpus and they are pretty busy
We are in the UK so it's out of hours (and our nightly process are mostly stopped right now)
We have the issue I mentioned where our exchange LUNs are on the same aggregate together and we have a high IO workload with 6000 mailboxes.
This is the sysstat -M 1 right now as an example. It seems high considering there is no de-dupe and only a single snapmirror running (to do a vol move for our exchange separation problem)
Any other thoughts and I am most interested
William
ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 92% 76% 56% 32% 68% 67% 73% 75% 59% 47% 0% 0% 22% 26% 0% 12% 105%( 65%) 0% 5% 0% 34% 9% 13% 9152 0% 98% 90% 80% 60% 85% 87% 89% 88% 75% 31% 0% 0% 26% 42% 0% 15% 122%( 72%) 21% 11% 0% 52% 9% 9% 5637 50% 99% 92% 80% 55% 84% 86% 90% 91% 69% 31% 0% 0% 29% 58% 0% 16% 103%( 66%) 3% 11% 0% 67% 9% 8% 5962 100% 98% 91% 79% 54% 84% 84% 88% 90% 72% 45% 0% 0% 27% 45% 0% 9% 122%( 78%) 0% 12% 0% 54% 8% 10% 7222 100% 99% 94% 88% 67% 89% 91% 94% 94% 77% 25% 0% 0% 28% 63% 0% 21% 100%( 65%) 25% 9% 0% 71% 8% 7% 4452 100% 97% 91% 79% 52% 83% 84% 89% 91% 67% 39% 0% 0% 30% 51% 0% 8% 113%( 73%) 0% 10% 0% 62% 8% 9% 8253 100% 98% 87% 71% 44% 78% 79% 84% 83% 67% 46% 0% 0% 25% 33% 0% 14% 121%( 74%) 0% 9% 0% 42% 12% 11% 9237 66% 97% 93% 86% 65% 88% 87% 92% 93% 80% 29% 0% 0% 27% 50% 0% 22% 116%( 69%) 24% 9% 0% 59% 9% 8% 5213 63% 97% 85% 69% 42% 76% 77% 83% 85% 60% 37% 0% 0% 28% 42% 0% 9% 109%( 69%) 1% 10% 0% 48% 11% 10% 6795 100% 98% 91% 77% 50% 82% 83% 88% 91% 66% 39% 0% 0% 30% 50% 0% 8% 116%( 73%) 0% 10% 0% 58% 7% 10% 6993 100% 98% 92% 82% 62% 86% 85% 90% 91% 78% 29% 0% 0% 28% 51% 0% 20% 108%( 69%) 21% 14% 0% 59% 6% 8% 5308 90% 100% 97% 91% 65% 90% 92% 94% 96% 80% 30% 0% 0% 30% 59% 0% 16% 120%( 76%) 3% 20% 0% 68% 9% 7% 5593 100% 98% 85% 70% 47% 78% 76% 82% 81% 71% 33% 0% 0% 26% 41% 0% 16% 110%( 70%) 4% 12% 0% 48% 10% 10% 5907 79% 100% 98% 89% 61% 89% 91% 94% 96% 75% 28% 0% 0% 32% 62% 0% 20% 98%( 65%) 17% 10% 0% 73% 8% 7% 5290 100% 98% 91% 77% 50% 82% 80% 85% 89% 72% 33% 0% 0% 30% 48% 0% 21% 108%( 64%) 0% 12% 0% 59% 6% 9% 6047 100% 99% 91% 75% 49% 82% 80% 84% 85% 77% 36% 0% 0% 26% 29% 0% 15% 144%( 80%) 1% 12% 0% 44% 10% 10% 6412 67% 100% 95% 88% 68% 90% 88% 94% 97% 80% 26% 0% 0% 29% 59% 0% 26% 100%( 66%) 23% 16% 0% 65% 8% 7% 4602 100% 98% 87% 74% 48% 79% 78% 86% 90% 63% 30% 0% 0% 29% 52% 0% 9% 105%( 68%) 0% 14% 0% 60% 9% 8% 5533 100% 98% 88% 77% 58% 83% 81% 87% 90% 73% 30% 0% 0% 27% 47% 0% 19% 106%( 66%) 21% 10% 0% 54% 9% 8% 5691 98% ANY1+ ANY2+ ANY3+ ANY4+ AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host Ops/s CP 97% 86% 70% 43% 77% 77% 84% 87% 61% 39% 0% 0% 27% 39% 0% 7% 116%( 73%) 0% 11% 0% 49% 11% 10% 7526 100% 97% 86% 70% 44% 78% 80% 85% 87% 61% 34% 0% 0% 28% 44% 0% 9% 108%( 68%) 0% 14% 0% 53% 11% 13% 6308 100% 98% 87% 77% 59% 82% 80% 86% 88% 76% 28% 0% 0% 24% 44% 0% 23% 106%( 66%) 21% 14% 0% 53% 9% 8% 5200 82% 100% 96% 86% 57% 87% 88% 92% 95% 73% 30% 0% 0% 30% 56% 0% 18% 111%( 69%) 3% 18% 0% 68% 6% 8% 5163 100% 98% 90% 78% 55% 83% 82% 88% 91% 69% 32% 0% 0% 28% 44% 0% 11% 119%( 74%) 6% 19% 0% 54% 9% 9% 6148 99% 100% 97% 89% 64% 89% 91% 93% 96% 75% 34% 0% 0% 30% 62% 0% 22% 99%( 65%) 17% 9% 0% 70% 6% 8% 6496 100%
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 20:14 To: Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Even if it is 10k ops after 5 minutes...that is only 33 ops per second. I doubt 33 unaligned ops per second is your cpu issue.
Maybe you can fix that one top talker just to show support that is not the issue? ...depending how critical that 1 system is that may or may not be worth fighting over support with.
Now, on to the cpu issue. Are using "sysstat -m 1" to look at all cpus and not only the "ANY" cpu metric right?
If you do , for example, "sysstat -x 1" you are looking at the % of time that ANY of your cpus are busy. Seems to me this metric is nearly completely useless.
--Jordan
From: Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk] Sent: Monday, June 16, 2014 3:07 PM To: Jordan Slingerland; Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
Thanks
I reset with the -z switch
I then run -d again a 5 minutes later. Many of the counters are in the 10's so I am happy with this. However 1 server is in the thousands already. This is a windows 2000 server (don't ask please!) which has a misaligned C drive but I have used the "functional aligned" datastore in VSC to get around this. I assume nfsstat -d won't understand that hence the counters in the thousands
William
From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com] Sent: 16 June 2014 19:57 To: Burchell, Will (ITSD); Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: RE: High CPU VM misalignment confusion
First off, make sure the values in nfsstat -d are actually incrementing significantly by running nfsstat -z to clear the counters and then wait a while and looking at nfsstat -d again.
You may find that you are only doing a handful of unaligned ops and not hundreds or thousands per second.
--Jordan
From: toasters-bounces@teaparty.netmailto:toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk Sent: Monday, June 16, 2014 2:50 PM To: Toasters@teaparty.netmailto:Toasters@teaparty.net Subject: High CPU VM misalignment confusion
Hello. I am hoping you can guide me in the right direction
We have been experiencing very high CPU load on a 7-mode HA pair of 3270 controllers run 8.1.3P2
We have worked with netapp support on these issues and they note our workload is very high on one controller (where we run our VMware setup from)
We also have so called "bad practice" where we are running our exchange ISCSI LUNs on SATA with logs and dbs on the same aggregate (currently separating this out as I type)
I have been told by support we have VMDK misalignment, however I spent a long time a few months ago resolving this firstly by using the VSC tool to confirm the problem and then fixing it with a combination of MBRALIGN and VMware converter as a V2V process
The support guy tells me he seems misalignment when he runs nfsstat -d but MBRSCAN shows these are aligned. What is going on here?
Trying to reduce our CPU and IO burden but getting conflicting information.
Finally I think we should look to upgrade to 8.1.4P2 to remove some bugs? We would consider 8.2.x but I don't think we can as we run Exchange 2010 (using SME 6.x etc)
Thanks in advance
William
The support guy tells me he seems misalignment when he runs nfsstat -d but MBRSCAN shows these are aligned. What is going on here? ---
Easy, support isn't PS.
He's using the output of a tool to provide knowledge, without understanding what the tool is measuring.
MBRSCAN is correct.
Nfstsat is observing a signature, without the context of what's creating it, and that would be DB log writes.
"You've got to learn WHY things work on a Starship." -Captain Kirk
On Mon, Jun 16, 2014 at 11:49 AM, Will.Burchell@skanska.co.uk wrote:
Hello. I am hoping you can guide me in the right direction
We have been experiencing very high CPU load on a 7-mode HA pair of 3270 controllers run 8.1.3P2
We have worked with netapp support on these issues and they note our workload is very high on one controller (where we run our VMware setup from)
We also have so called "bad practice" where we are running our exchange ISCSI LUNs on SATA with logs and dbs on the same aggregate (currently separating this out as I type)
I have been told by support we have VMDK misalignment, however I spent a long time a few months ago resolving this firstly by using the VSC tool to confirm the problem and then fixing it with a combination of MBRALIGN and VMware converter as a V2V process
The support guy tells me he seems misalignment when he runs nfsstat -d but MBRSCAN shows these are aligned. What is going on here?
Trying to reduce our CPU and IO burden but getting conflicting information.
Finally I think we should look to upgrade to 8.1.4P2 to remove some bugs? We would consider 8.2.x but I don't think we can as we run Exchange 2010 (using SME 6.x etc)
Thanks in advance
William
Toasters mailing list Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
I have to agree with Jeff M here. The histogram shown in nfsstat might show unaligned IO, but that doesn't mean misaligned IO. You should expect significant false positives with this tool when any kind of database is involved. The reason is the logging, which produces partial writes, but they virtually never cause problems because it's a sequential overwrite of a file. The partial writes only exist for a split-second before the next write fills out the block.
If MBRscan shows it's aligned, then everything should be just fine.
From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Jeff Mohler Sent: Monday, June 16, 2014 9:00 PM To: Will.Burchell@skanska.co.uk Cc: toasters@teaparty.net Subject: Re: High CPU VM misalignment confusion
The support guy tells me he seems misalignment when he runs nfsstat -d but MBRSCAN shows these are aligned. What is going on here? --- Easy, support isn't PS.
He's using the output of a tool to provide knowledge, without understanding what the tool is measuring.
MBRSCAN is correct. Nfstsat is observing a signature, without the context of what's creating it, and that would be DB log writes.
"You've got to learn WHY things work on a Starship." -Captain Kirk
On Mon, Jun 16, 2014 at 11:49 AM, <Will.Burchell@skanska.co.ukmailto:Will.Burchell@skanska.co.uk> wrote: Hello. I am hoping you can guide me in the right direction
We have been experiencing very high CPU load on a 7-mode HA pair of 3270 controllers run 8.1.3P2
We have worked with netapp support on these issues and they note our workload is very high on one controller (where we run our VMware setup from)
We also have so called "bad practice" where we are running our exchange ISCSI LUNs on SATA with logs and dbs on the same aggregate (currently separating this out as I type)
I have been told by support we have VMDK misalignment, however I spent a long time a few months ago resolving this firstly by using the VSC tool to confirm the problem and then fixing it with a combination of MBRALIGN and VMware converter as a V2V process
The support guy tells me he seems misalignment when he runs nfsstat -d but MBRSCAN shows these are aligned. What is going on here?
Trying to reduce our CPU and IO burden but getting conflicting information.
Finally I think we should look to upgrade to 8.1.4P2 to remove some bugs? We would consider 8.2.x but I don't think we can as we run Exchange 2010 (using SME 6.x etc)
Thanks in advance
William
_______________________________________________ Toasters mailing list Toasters@teaparty.netmailto:Toasters@teaparty.net http://www.teaparty.net/mailman/listinfo/toasters
-- --- Gustatus Similis Pullus