Sphinx RT-index updating in parallel

爱⌒轻易说出口 提交于 2020-01-16 14:16:06

问题


Since I can't find it anywhere, is it possible to update the RT-index of Sphinx in parallel?

For instance, I have noticed a reduction in processing speed when documents have more then 1.000.000 words. Therefore, I would like to split my processor in processing documents with over 1.000.000 words in a separate thread, not holding back the smaller documents from being processed.

However, I haven't been able to find any benchmarks of parallel updating the RT-index. Neither I have found any documentation of it?

Are there others who are using this approach or is it considered bad practice?


回答1:


First of all let me remind you that when you update smth in a Sphinx (actually manticore search/lucene/solr/elastic too) real time index you don't actually update anything, you just add the change to a new segment (RAM chunk in case of Sphinx) which will eventually (mostly much later) will be merged with other segments and the change will be really applied. Therefore the question is how fast you can populate RT RAM chunk with new records and how concurrency changes the throughput. I've made a test based on https://github.com/Ivinco/stress-tester and here's what I've got:

snikolaev@dev:~/stress_tester_github$ for conc in 1 2 5 8 11; do ./test.php --plugin=rt_insert.php -b=100 --data=/home/snikolaev/hacker_news_comments.smaller.csv -c=$conc --limit=100000 --csv; done;
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
1;100;28.258;3537;100000;99957;0.275;0.202;0.519;1.221
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
2;100;18.811;5313;100000;99957;0.34;0.227;0.673;2.038
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
5;100;16.751;5967;100000;99957;0.538;0.326;1.163;3.797
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
8;100;20.576;4857;100000;99957;0.739;0.483;1.679;5.527
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
11;100;23.55;4244;100000;99957;0.862;0.54;2.102;5.849

I.e. increasing concurrency from 1 to 11 (in my case on 8-core server) can let you increase the throughput from 3500 to 4200 documents per second. I.e. 20% - not bad, but not that great performance boost.

In your case perhaps another way can work out - you can update not one, but multiple indexes and then have a distributed index to combine them all. In other case you can do so called sharding. For example if you write to two RT indexes instead of one you can get this:

snikolaev@dev:~/stress_tester_github$ for conc in 1 2 5 8 11; do ./test.php --plugin=rt_insert.php -b=100 --data=/home/snikolaev/hacker_news_comments.smaller.csv -c=$conc --limit=100000 --csv; done;
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
1;100;28.083;3559;100000;99957;0.274;0.206;0.514;1.223
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
2;100;18.03;5543;100000;99957;0.328;0.225;0.653;1.919
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
5;100;15.07;6633;100000;99957;0.475;0.264;1.066;3.821
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
8;100;18.608;5371;100000;99957;0.613;0.328;1.479;4.897
concurrency;batch size;total time;throughput;elements count;latencies count;avg latency, ms;median latency, ms;95p latency, ms;99p latency, ms
11;100;26.071;3833;100000;99957;0.632;0.294;1.652;4.729

i.e. 6600 docs per sec at concurrency 5. Now it's almost 90% better than the initial throughput which seems to be a good result. Playing with the # of indexes and concurrencies you can find the optimal settings for your case.



来源:https://stackoverflow.com/questions/52914989/sphinx-rt-index-updating-in-parallel

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!