问题
I have a use case for Lucene in which the search types required are very simple. I'll likely use DOCS_ONLY indexing with no stored fields or any complicated add-ons. The documents are unstructured English text.
For this use case the most important thing to optimize is the compression ratio of the original documents to the on-disk size of the index. The Lucene index should be as small as possible, even at the expense of increased search and update latency.
I'm wondering how I should configure Lucene (any version) to accomplish this. In particular, what codec should be used? Is there one that emphasizes compression over search speed? Are there any other settings I can tweak that will optimize postings list compression?
tl;dr: For DOCS_ONLY indexing in Lucene, how can I make the index as small as possible?
回答1:
In general key idea to decrease index size is - store as little as possible, index as little as possible.
Few questions that come first before getting a right answer for your question. For example, how big is your index, and how much do you expect it to grow? I ask this because it's probably not worth your time to try to reduce the index size below some threshold.
I have seen previously, people reduce index size upto 40%-50% by using SimpleAnalyzer to write the documents in index instead of using StandardAnalyzer (which takes more storage generally) but that affected the search performance. You mentioned in your post that you are ready to afford increase in search time but are you ready to sacrifice search performance? This is a very important question. Its not worth to give effort to reduce the size of the index if you have already reached a threshold!
There are other factors i have seen people changes to reduce size. For example, according to the docs Index.NO_NORMS
will save you one byte per document in the index. Even sometimes people say, to compress numerical data (i never checked myself), base of the number can be changed that is indexed/stored in the index.
Moreover, i guess this two following posts in SO will be helpful for you.
(1) SOLR index size reduction (2) How to reduce the size of a generated Lucene/Solr index?
You can read this post too.
来源:https://stackoverflow.com/questions/40903569/optimize-lucene-for-compression-ratio