Sphinx RT-index updating in parallel

问题

Since I can't find it anywhere, is it possible to update the RT-index of Sphinx in parallel?

For instance, I have noticed a reduction in processing speed when documents have more then 1.000.000 words. Therefore, I would like to split my processor in processing documents with over 1.000.000 words in a separate thread, not holding back the smaller documents from being processed.

However, I haven't been able to find any benchmarks of parallel updating the RT-index. Neither I have found any documentation of it?

Are there others who are using this approach or is it considered bad practice?

回答1:

First of all let me remind you that when you update smth in a Sphinx (actually manticore search/lucene/solr/elastic too) real time index you don't actually update anything, you just add the change to a new segment (RAM chunk in case of Sphinx) which will eventually (mostly much later) will be merged with other segments and the change will be really applied. Therefore the question is how fast you can populate RT RAM chunk with new records and how concurrency changes the throughput. I've made a test based on https://github.com/Ivinco/stress-tester and here's what I've got:

snikolaev@dev:~/stress_tester_github$ for conc in 1 2 5 8 11; do ./test.php --plugin=rt_insert.php -b=100 --data=/home/snikolaev/hacker_news_comments.smaller.csv -c=$conc --limit=100000 --csv; done;
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
1;100;28.258;3537;100000;99957;0.275;0.202;0.519;1.221
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
2;100;18.811;5313;100000;99957;0.34;0.227;0.673;2.038
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
5;100;16.751;5967;100000;99957;0.538;0.326;1.163;3.797
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
8;100;20.576;4857;100000;99957;0.739;0.483;1.679;5.527
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
11;100;23.55;4244;100000;99957;0.862;0.54;2.102;5.849

I.e. increasing concurrency from 1 to 11 (in my case on 8-core server) can let you increase the throughput from 3500 to 4200 documents per second. I.e. 20% - not bad, but not that great performance boost.

In your case perhaps another way can work out - you can update not one, but multiple indexes and then have a distributed index to combine them all. In other case you can do so called sharding. For example if you write to two RT indexes instead of one you can get this:

snikolaev@dev:~/stress_tester_github$ for conc in 1 2 5 8 11; do ./test.php --plugin=rt_insert.php -b=100 --data=/home/snikolaev/hacker_news_comments.smaller.csv -c=$conc --limit=100000 --csv; done;
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
1;100;28.083;3559;100000;99957;0.274;0.206;0.514;1.223
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
2;100;18.03;5543;100000;99957;0.328;0.225;0.653;1.919
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
5;100;15.07;6633;100000;99957;0.475;0.264;1.066;3.821
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
8;100;18.608;5371;100000;99957;0.613;0.328;1.479;4.897
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
11;100;26.071;3833;100000;99957;0.632;0.294;1.652;4.729

i.e. 6600 docs per sec at concurrency 5. Now it's almost 90% better than the initial throughput which seems to be a good result. Playing with the # of indexes and concurrencies you can find the optimal settings for your case.

来源：https://stackoverflow.com/questions/52914989/sphinx-rt-index-updating-in-parallel

标签

sphinx