Re: FAS8040 getting crushed by Solr replication

3 Dec 2015


      Jr> I have an 8040 with 88 x 900G 10k disks, all assigned to a single
Jr> aggregate on one of the controllers. There are a few volumes on
Jr> here, all vSphere NFS datastores. This aggregate also has a slice
Jr> of flash pool assigned to it, currently about 900GB usable.
Do you have compression or dedupe turned on these volumes?  And how
much space is your SOLR data taking?
Jr> We recently deployed some CentOS 6 VMs on these datastores that
Jr> are running Solr, which is an applciation that is used for
Jr> distributed indexing. The replication is done in typical a master/
Jr> slave relationship. My understanding of Solr's replication is that
Jr> it is done periodically where the slaves download any new index
Jr> files that exist on the master but not on the slaves, into a temp
Jr> location, and then the slaves replace their existing index files
Jr> with the new index files from the master. So it appears to be a
Jr> mostly sequential write process.
That would imply to me that all the indexing of the documents happens
on the master, and that these slaves are just querying the index.  If
they're copies, do you need to keep them in the data store at all?
Would it be more effective to keep them purely locally to the VMs,
either on local datastores on the ESX host(s), or even in memory.
Though in that case I'd argue that just having one big VM with enough
memory to cache the readonly index would make more sense... but
reliability would suffer of course.
Jr> During the replication events, we are seeing the controller
Jr> hosting this particular datastore basically getting crushed and
Jr> issuing B and b CPs. Here is some output of sysstat during one of
Jr> the replication events:
Can the replication events be staggered, or does each SOLR slave wake
up at the same time, copy the file and then write it to it's own
datastore, which I suspect is distinct for some of them?
Jr>  CPU   Total     Net   kB/s    Disk   kB/s    Tape   kB/s  Cache  Cache    CP  CP  Disk
Jr>        ops/s      in    out    read  write    read  write    age    hit  time  ty  util
Jr>   7%     854   60795  38643   14108 107216       0      0    20     96%  100%  :     9%
Jr>   7%     991   61950  41930    6542  89350       0      0    20     95%  100%  :     9%
Jr>   4%     977   62900  38820    1244   2932       0      0    20     93%    9%  :     1%
Jr>   4%     811   52853  35658      76     12       0      0    20     96%    0%  -     1%
Jr>   5%     961   67428  43600      60     12       0      0    20     97%    0%  -     1%
Jr>   4%     875   57204  41222      66      4       0      0    20     97%    0%  -     1%
Jr>   5%    1211   78933  59481     110     12       0      0    20     97%    0%  -     1%
Jr>  16%    1024   55549  31785   33306  84626       0      0    20     97%   89%  T    14%
Jr>   7%    1164   56356  36122   14830 102808       0      0    20     96%  100%  :     8%
Jr>  49%   13991  909816  56134    3926  62136       0      0    24     82%  100%  B     7%
Jr>  78%   13154  842333  55302   53011 868408       0      0    24     83%  100%  :    51%
Jr>  83%   12758  818914  59706   44897 742156       0      0    23     89%   97%  F    45%
Jr>  84%   11997  765669  53760   64084 958309       0      0    26     89%  100%  B    59%
Jr>  80%   11823  725972  46004   73227 867704       0      0    26     88%  100%  B    51%
Jr>  83%   15125  957531  46144   42439 614295       0      0    23     87%  100%  B    36%
Jr>  74%    9584  612985  42404   67147 839408       0      0    24     93%  100%  B    48%
Jr>  78%   11367  751672  64071   49881 770340       0      0    24     88%  100%  B    46%
Jr>  79%   12468  822736  53757   38995 595721       0      0    24     87%  100%  #    34%
Jr>  56%    6315  396022  48623   42597 601630       0      0    24     94%  100%  B    35%
Jr>  67%    7923  554797  56459   26309 715759       0      0    24     87%  100%  #    43%
Jr>  69%   13719  879990  37401   41532 333768       0      0    22     87%  100%  B    22%
Jr>  45%      24   52946  42826   33186 736345       0      0    22     98%  100%  #    41%
Jr>  72%   13909  888007  46266   29109 485422       0      0    22     87%  100%  B    28%
Jr>  59%    8036  523206  53199   41719 646767       0      0    22     90%  100%  B    37%
Jr>  68%    7336  505544  63590   46602 870744       0      0    22     91%  100%  B    49%
Jr>  71%   12673  809175  29070   21208 556669       0      0     6     89%  100%  #    38%
Jr>  70%   12097  726574  49381   36251 588939       0      0    24     90%  100%  B    35%
Jr> And here is some iostat output from one of the Solr slaves during the same timeframe:
your write numbers are simply huge!  No wonder it's getting crushed.
Maybe your SOLR setup can be changed so the six VMs share a data store
and only one of them actually updates it, while the others are just
readonly slaves doing lookups?
And do you have a flashcache?  Maybe setting up a de-duped dedicated
datastore for these SOLR clients would be the way to go here.  How big
is this data set?
Jr> 12/03/2015 06:48:36 PM
Jr> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
Jr>            7.54    0.00    7.42   44.12    0.00   40.92
Jr> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await 
Jr> svctm  %util
Jr> sda               0.00     4.50    0.00    0.00     0.00     0.00     0.00     5.46    0.00  
Jr> 0.00  62.65
Jr> sdb               0.00 26670.00    0.00  190.50     0.00    95.25  1024.00   162.75  214.87   5.25
Jr> 100.00
Jr> dm-0              0.00     0.00    1.00   11.50     0.00     0.04     8.00     5.59    0.00 
Jr> 50.12  62.65
Jr> dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00  
Jr> 0.00   0.00
Jr> dm-2              0.00     0.00    0.00    3.00     0.00     0.01     8.00     2.44    0.00
Jr> 135.33  40.60
Jr> dm-3              0.00     0.00    0.00 26880.00     0.00   105.00     8.00 20828.90  194.77  
Jr> 0.04 100.00
Jr> 12/03/2015 06:48:38 PM
Jr> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
Jr>            9.23    0.00   16.03   24.23    0.00   50.51
Jr> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await 
Jr> svctm  %util
Jr> sda               0.00   177.00    1.00   19.50     0.00     0.79    78.83     7.91  651.90 
Jr> 16.59  34.00
Jr> sdb               0.00 73729.00    0.00  599.50     0.00   299.52  1023.23   142.51  389.81   1.67
Jr> 100.00
Jr> dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     4.56    0.00  
Jr> 0.00  27.55
Jr> dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00  
Jr> 0.00   0.00
Jr> dm-2              0.00     0.00    0.00  186.50     0.00     0.73     8.00    87.75  483.59  
Jr> 1.82  34.00
Jr> dm-3              0.00     0.00    0.00 74310.00     0.00   290.27     8.00 18224.54  402.32  
Jr> 0.01 100.00
Jr> 12/03/2015 06:48:40 PM
Jr> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
Jr>            9.27    0.00   10.04   22.91    0.00   57.79
Jr> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await 
Jr> svctm  %util
Jr> sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00  
Jr> 0.00   0.00
Jr> sdb               0.00 24955.50    0.00  202.00     0.00   101.00  1024.00   142.07  866.56   4.95
Jr> 100.05
Jr> dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00  
Jr> 0.00   0.00
Jr> dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00  
Jr> 0.00   0.00
Jr> dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00  
Jr> 0.00   0.00
Jr> dm-3              0.00     0.00    0.00 25151.50     0.00    98.25     8.00 18181.29  890.67  
Jr> 0.04 100.05
Jr> 12/03/2015 06:48:42 PM
Jr> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
Jr>            9.09    0.00   12.08   21.95    0.00   56.88
Jr> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await 
Jr> svctm  %util
Jr> sda               0.00     2.50    0.00    1.50     0.00     0.01    18.67     0.46   36.33
Jr> 295.33  44.30
Jr> sdb               0.00 59880.50    0.00  461.50     0.00   230.75  1024.00   144.82  173.12  
Jr> 2.17  99.95
Jr> dm-0              0.00     0.00    0.00    1.00     0.00     0.00     8.00     0.81    0.00
Jr> 407.50  40.75
Jr> dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00  
Jr> 0.00   0.00
Jr> dm-2              0.00     0.00    0.00    3.50     0.00     0.01     8.00     0.13   37.29 
Jr> 10.14   3.55
Jr> dm-3              0.00     0.00    0.00 60352.50     0.00   235.75     8.00 18538.70  169.30  
Jr> 0.02 100.00
Jr> As you can see, we are getting some decent throughput, but it causes the latency to spike on the
Jr> filer. I have heard that the avgrq-sz in iostat is related to the block size, can anyone verify
Jr> that? Is a 1MB block size too much for the filer? I am still researching if there is a way to
Jr> modify this in Solr, but I haven't come up with much yet. Note, the old Solr slaves were made up
Jr> of physcal DL360p's with only a local 2-disk 10k RAID1. The new slaves and relay-master are
Jr> currently all connected with 10Gb, which removed the network 1Gb bottleneck for the replication,
Jr> which could be uncorking the bottle so-to-speak. I'm still at a loss why this is hurting the filer
Jr> so much though.
Jr> Any ideas?
Jr> --
Jr> GPG keyID: 0xFECC890C
Jr> Phil Gardner
Jr> _______________________________________________
Jr> Toasters mailing list
Jr> Toasters@teaparty.net
Jr> http://www.teaparty.net/mailman/listinfo/toasters

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

Re: FAS8040 getting crushed by Solr replication