On Thu, Dec 3, 2015 at 4:33 PM, John Stoffel <john(a)stoffel.org> wrote:
> >>>>> "Jr" == Jr Gardner <phil.gardnerjr(a)gmail.com> writes:
>
> So that means you're indexing 300Gb worth of data, or generating 300Gb
> worth of Solr indexes? How big are the generated Indexes? The reason
> I ask is because maybe if they are that big, it means you're writing
> 1.2Tb of indexes, which is the same data...
>
> So if you could maybe split the indexers into two pools, and have each
> pair have only one system doing the indexing... that might big a big
> savings.
>
Not all of the index files get written at the same time. That would only
happen for a new/fresh slave with no existing index. The index is split
into many files, and only the new ones that the master creates are pulled
down by the slaves and written to disk. We are talking about 1-2GB for an
update, but only if there are updates to the master. There may not be
updates for every replication check interval.
> How was the performance on the handful of DL360s? And how is the
> perforance of the slaves compared to the old setup? Even if you're
> beating the Netapp to death for a bit, maybe it's a net win?
>
Performance is "decent". Those are writing about as fast as the 8040 is
before slowing things down with the B CPs, 2-300MB/s.
> So this I think answers my question, the master trawls the dataset
> building the index, then the slaves copy those new indexes to their
> storage area.
>
> And it really is too bad that you can't just use the Netapp for the
> replication with either a snapshot or a snapmirror to replicate the
> new files to the six slaves. If they're read only, you should be
> working hard to keep them from reading the same file six times from
> the master and then writing six copies back to the Netapp.
>
> Now hopefully the clients aren't re-writing all 300Gb each time, and
> the write numbers you show are simply huge! You're seeing 10x the
> writes compared to reads, which implies that these slaves aren't setup
> right. They should be read/mostly!
>
> Does the index really need to be updated every three minutes? That's
> a pretty darn short time.
>
> And is there other load on the 8040 cluster as well?
>
Yeah, It looks like this is something we are going to have to redesign
before we go the virtualized instances. I agree there are a few different
ways to do it, but the snapmirror option seems like a good one. I also
agree that its ineffecient to have the same copy of data getting written
numerous times.
These are really write heavy at the moment because they are not in
production yet. We just recently got these new VMs set up and going, and I
was just watching performance take a hit across this controller and wanted
to investigate.
The index does need to be updated relatively frequently. We use this
particular index to store vehicle/inventory data for the frontend web site
to query, so when a client makes changes on the backend, they want the
changes to take affect relatively quickly; otherwise its a "bad experience"
:)
> What's the load on the Netapp when no nodes are writing at all? Are
> you getting hit by lots of writes then? If so... you need more
> spindles. And how full is the aggregate? And how busy/full are the
> other volumes?
>
Here is a snapshot of sysstat when those solr slaves are not writing to
disk:
CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache
CP CP Disk
ops/s in out read write read write age hit
time ty util
3% 705 18201 2460 26 0 0 0 3s 97%
0% - 1%
14% 1128 36607 12876 29246 59480 0 0 3s 99%
62% T 10%
6% 968 33432 9258 15874 98826 0 0 3s 97%
100% : 8%
6% 619 29781 20739 5605 95763 0 0 3s 99%
100% : 9%
4% 1055 43136 15750 84 18108 0 0 3s 97%
25% : 3%
13% 1041 38311 5779 52 12 0 0 3s 98%
0% - 1%
4% 1089 33113 7183 44 0 0 0 3s 99%
0% - 1%
4% 1277 43362 14837 86 16 0 0 3s 98%
0% - 1%
5% 1843 48354 24844 26 12 0 0 3s 100%
0% - 1%
16% 1849 57845 21218 29590 65774 0 0 3s 99%
74% T 10%
6% 772 39316 24262 17096 97466 0 0 3s 97%
100% : 8%
7% 1019 57397 32988 19028 86126 0 0 3s 99%
100% : 8%
13% 843 29941 9331 882 37964 0 0 3s 96%
62% : 2%
3% 759 31928 12799 40 12 0 0 3s 97%
0% - 2%
4% 1216 56116 26869 88 16 0 0 3s 98%
0% - 1%
3% 904 49644 25957 38 0 0 0 3s 97%
0% - 1%
5% 1300 60471 36002 96 12 0 0 3s 94%
0% - 1%
16% 2127 45971 6880 23822 29086 0 0 3s 99%
36% T 10%
Pretty quiet. The timer CP is much nicer to see...
Aggregate is not full at all:
filer::> df -A -h -x aggr_sas900
Aggregate total used avail capacity
aggr_sas900 58TB 30TB 28TB 52%
And this particular volume is only at 65% capacity. Other volumes aren't
over 80% either.
--
GPG keyID: 0xFECC890C
Phil Gardner