How to improve a single character PrefixQuery performance?

这一生的挚爱 提交于 2019-12-20 03:41:22

问题


I have a RAMDirectory with 1.5 million documents and I'm searching using a PrefixQuery for a single field. When the search text has a length of 3 or more characters, the search is extremely fast, less than 20 milliseconds. But when the search text has a length of less than 3 characters, the search might take even a full 1 second.

Since it's an auto complete feature and the user starts with one character (and there are results that are indeed 1 char length), I cannot restrict the length of the search text.

The code is pretty much:

var symbolCodeTopDocs = searcher.Search(new PrefixQuery(new Term("SymbolCode", searchText), 10);

The SymbolCode is a NOT_ANALYZED field. The Lucene.NET version is 3.0.3.

The example is simplified, and I might have to use a BooleanQuery to apply additional constraints in a real world scenario.

How can I improve performance on this specific case? These single-char or two-char queries are bringing the server down.


回答1:


Consider removing stop words from your index if you haven't already.

To understand how stop words slow down PrefixQuery then consider how PrefixQuery works: It is rewritten as a BooleanQuery that includes every term from the index beginning with the PrefixQuery's term. For example a* becomes a OR and OR aardvark OR anchor OR ... So far this isn't bad and it will perform surprisingly well even with thousands of terms. The real drain is when stop words like a and and are included because they'll likely be found multiple times in every single document in your index. This creates a lot more work for the gathering/collecting/scoring portion of the search and thus slows things down.

On a side note, I highly recommend not running the autocomplete search when the user has entered less than 2 or 3 characters, purely from a usability perspective. I can't imagine the results would be at all relevant. Imagine running a search for a* -- there's no way to tell which results are more relevant. If you must display something to the user then consider an n-gram approach like Jf Beaulac suggested in the comments.



来源:https://stackoverflow.com/questions/14059985/how-to-improve-a-single-character-prefixquery-performance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!