Marcus thanks for your reply. I didn’t get it yet via email here

 

Reading your blog article is interesting

 

I guess you picked up on the following being so high [WAFL_Ex(Kahu)] cpu

 

I wonder if it’s the SIS stale fingerprints mentioned by Jordan but will check your link and our setup

 

Thank you

 

From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com]
Sent: 16 June 2014 20:47
To: Marcus Nilsson; Burchell, Will (ITSD); Toasters@teaparty.net
Subject: RE: High CPU VM misalignment confusion

 

It sounds like you ruled out the obvious, but I will say it anyway.   but no deduplciations running, right? 

 

And not so obvious,  If none running,  look at sis status –l and check if any of the volumes are over 20% in the Stale Fingerprints: column.

 

 

--Jordan

 

From: Marcus Nilsson [mailto:marcus.nilsson@atea.se]
Sent: Monday, June 16, 2014 3:44 PM
To: Will.Burchell@skanska.co.uk; Jordan Slingerland; Toasters@teaparty.net
Subject: RE: High CPU VM misalignment confusion

 

Hi,

Might be worth checking out the article at this link http://www.jk-47.com/2014/02/attack-of-old-bugs-netapp-high-cpu/

 

We ran into this exact issue after upgrading a system from 8.0.3P2 to 8.1.4P1. A process looping in wafl scan blk_reclaim.

 

BR Marcus

 

 

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.uk
Sent: den 16 juni 2014 21:19
To: Jordan.Slingerland@independenthealth.com; Toasters@teaparty.net
Subject: RE: High CPU VM misalignment confusion

 

I am checking all cpus and they are pretty busy

 

We are in the UK so it’s out of hours (and our nightly process are mostly stopped right now)

 

We have the issue I mentioned where our exchange LUNs are on the same aggregate together and we have a high IO workload with 6000 mailboxes.

 

This is the sysstat –M 1 right now as an example. It seems high considering there is no de-dupe and only a single snapmirror running (to do a vol move for our exchange separation problem)

 

Any other thoughts and I am most interested

 

William

 

ANY1+ ANY2+ ANY3+ ANY4+  AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host  Ops/s   CP

  92%   76%   56%   32%  68%  67%  73%  75%  59%     47%       0%      0%     22%  26%     0%    12%    105%( 65%)          0%        5%   0%    34%   9%  13%   9152   0%

  98%   90%   80%   60%  85%  87%  89%  88%  75%     31%       0%      0%     26%  42%     0%    15%    122%( 72%)         21%       11%   0%    52%   9%   9%   5637  50%

  99%   92%   80%   55%  84%  86%  90%  91%  69%     31%       0%      0%     29%  58%     0%    16%    103%( 66%)          3%       11%   0%    67%   9%   8%   5962 100%

  98%   91%   79%   54%  84%  84%  88%  90%  72%     45%       0%      0%     27%  45%     0%     9%    122%( 78%)          0%       12%   0%    54%   8%  10%   7222 100%

  99%   94%   88%   67%  89%  91%  94%  94%  77%     25%       0%      0%     28%  63%     0%    21%    100%( 65%)         25%        9%   0%    71%   8%   7%   4452 100%

  97%   91%   79%   52%  83%  84%  89%  91%  67%     39%       0%      0%     30%  51%     0%     8%    113%( 73%)          0%       10%   0%    62%   8%   9%   8253 100%

  98%   87%   71%   44%  78%  79%  84%  83%  67%     46%       0%      0%     25%  33%     0%    14%    121%( 74%)          0%        9%   0%    42%  12%  11%   9237  66%

  97%   93%   86%   65%  88%  87%  92%  93%  80%     29%       0%      0%     27%  50%     0%    22%    116%( 69%)         24%        9%   0%    59%   9%   8%   5213  63%

  97%   85%   69%   42%  76%  77%  83%  85%  60%     37%       0%      0%     28%  42%     0%     9%    109%( 69%)          1%       10%   0%    48%  11%  10%   6795 100%

  98%   91%   77%   50%  82%  83%  88%  91%  66%     39%       0%      0%     30%  50%     0%     8%    116%( 73%)          0%       10%   0%    58%   7%  10%   6993 100%

  98%   92%   82%   62%  86%  85%  90%  91%  78%     29%       0%      0%     28%  51%     0%    20%    108%( 69%)         21%       14%   0%    59%   6%   8%   5308  90%

100%   97%   91%   65%  90%  92%  94%  96%  80%     30%       0%      0%     30%  59%     0%    16%    120%( 76%)          3%       20%   0%    68%   9%   7%   5593 100%

  98%   85%   70%   47%  78%  76%  82%  81%  71%     33%       0%      0%     26%  41%     0%    16%    110%( 70%)          4%       12%   0%    48%  10%  10%   5907  79%

100%   98%   89%   61%  89%  91%  94%  96%  75%     28%       0%      0%     32%  62%     0%    20%     98%( 65%)         17%       10%   0%    73%   8%   7%   5290 100%

  98%   91%   77%   50%  82%  80%  85%  89%  72%     33%       0%      0%     30%  48%     0%    21%    108%( 64%)          0%       12%   0%    59%   6%   9%   6047 100%

  99%   91%   75%   49%  82%  80%  84%  85%  77%     36%       0%      0%     26%  29%     0%    15%    144%( 80%)          1%       12%   0%    44%  10%  10%   6412  67%

100%   95%   88%   68%  90%  88%  94%  97%  80%     26%       0%      0%     29%  59%     0%    26%    100%( 66%)         23%       16%   0%    65%   8%   7%   4602 100%

  98%   87%   74%   48%  79%  78%  86%  90%  63%     30%       0%      0%     29%  52%     0%     9%    105%( 68%)          0%       14%   0%    60%   9%   8%   5533 100%

  98%   88%   77%   58%  83%  81%  87%  90%  73%     30%       0%      0%     27%  47%     0%    19%    106%( 66%)         21%       10%   0%    54%   9%   8%   5691  98%

ANY1+ ANY2+ ANY3+ ANY4+  AVG CPU0 CPU1 CPU2 CPU3 Network Protocol Cluster Storage Raid Target Kahuna WAFL_Ex(Kahu) WAFL_XClean SM_Exempt Cifs Exempt Intr Host  Ops/s   CP

  97%   86%   70%   43%  77%  77%  84%  87%  61%     39%       0%      0%     27%  39%     0%     7%    116%( 73%)          0%       11%   0%    49%  11%  10%   7526 100%

  97%   86%   70%   44%  78%  80%  85%  87%  61%     34%       0%      0%     28%  44%     0%     9%    108%( 68%)          0%       14%   0%    53%  11%  13%   6308 100%

  98%   87%   77%   59%  82%  80%  86%  88%  76%     28%       0%      0%     24%  44%     0%    23%    106%( 66%)         21%       14%   0%    53%   9%   8%   5200  82%

100%   96%   86%   57%  87%  88%  92%  95%  73%     30%       0%      0%     30%  56%     0%    18%    111%( 69%)          3%       18%   0%    68%   6%   8%   5163 100%

  98%   90%   78%   55%  83%  82%  88%  91%  69%     32%       0%      0%     28%  44%     0%    11%    119%( 74%)          6%       19%   0%    54%   9%   9%   6148  99%

100%   97%   89%   64%  89%  91%  93%  96%  75%     34%       0%      0%     30%  62%     0%    22%     99%( 65%)         17%        9%   0%    70%   6%   8%   6496 100%

 

From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com]
Sent: 16 June 2014 20:14
To: Burchell, Will (ITSD); Toasters@teaparty.net
Subject: RE: High CPU VM misalignment confusion

 

Even if it is 10k ops after 5 minutes...that is only 33 ops per second.  I doubt 33 unaligned ops per second is your cpu issue.

 

Maybe you can fix that one top talker just to show support that is not the issue?  …depending how critical that 1 system is that may or may not be worth fighting over support with.

 

Now, on to the cpu issue.  Are using “sysstat –m 1” to look at all cpus and not only the “ANY” cpu metric right?

 

If you do , for example, “sysstat –x 1” you are looking at the % of time that ANY of your cpus are busy.  Seems to me this metric is  nearly completely useless.

 

 

--Jordan

 

 

From: Will.Burchell@skanska.co.uk [mailto:Will.Burchell@skanska.co.uk]
Sent: Monday, June 16, 2014 3:07 PM
To: Jordan Slingerland; Toasters@teaparty.net
Subject: RE: High CPU VM misalignment confusion

 

Thanks

 

I reset with the –z switch

 

I then run –d again a 5 minutes later. Many of the counters are in the 10’s so I am happy with this. However 1 server is in the thousands already. This is a windows 2000 server (don’t ask please!) which has a misaligned C drive but I have used the “functional aligned” datastore in VSC to get around this. I assume nfsstat –d won’t understand that hence the counters in the thousands

 

William

 

 

From: Jordan Slingerland [mailto:Jordan.Slingerland@independenthealth.com]
Sent: 16 June 2014 19:57
To: Burchell, Will (ITSD); Toasters@teaparty.net
Subject: RE: High CPU VM misalignment confusion

 

First off, make sure the values in nfsstat –d are actually incrementing significantly by  running nfsstat –z to clear the counters and then wait a while and looking at nfsstat –d again.

 

You may find that you are only doing a handful of unaligned ops and not hundreds or thousands per second.

 

 

--Jordan

 

 

From: toasters-bounces@teaparty.net [mailto:toasters-bounces@teaparty.net] On Behalf Of Will.Burchell@skanska.co.uk
Sent: Monday, June 16, 2014 2:50 PM
To: Toasters@teaparty.net
Subject: High CPU VM misalignment confusion

 

Hello. I am hoping you can guide me in the right direction

 

We have been experiencing very high CPU load on a 7-mode HA pair of 3270 controllers run 8.1.3P2

 

We have worked with netapp support on these issues and they note our workload is very high on one controller (where we run our VMware setup from)

 

We also have so called “bad practice” where we are running our exchange ISCSI LUNs on SATA with logs and dbs on the same aggregate (currently separating this out as I type)

 

I have been told by support we have VMDK misalignment, however I spent a long time a few months ago resolving this firstly by using the VSC tool to confirm the problem and then fixing it with a combination of MBRALIGN and VMware converter as a V2V process

 

The support guy tells me he seems misalignment when he runs nfsstat –d but MBRSCAN shows these are aligned. What is going on here?

 

Trying to reduce our CPU and IO burden but getting conflicting information.

 

Finally I think we should look to upgrade to 8.1.4P2 to remove some bugs? We would consider 8.2.x but I don’t think we can as we run Exchange 2010 (using SME 6.x etc)

 

Thanks in advance

 

William