问题
To achieve some degree of fault tolerance with Solr I have started to use the NGramFilterFactory
. Here are the intersting bits from the schema.xml
:
<field name="text" type="text" indexed="true" stored="true"/>
<copyField source="text" dest="text_ngram" />
<field name="text_ngram" type="text_ngram" indexed="true" stored="false"/>
<fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="3" />
</analyzer>
</fieldType>
I am using the EDisMax
query handler with pretty much the stock configuration. Here are the interesting lines from the solrconfig.xml
:
<requestHandler name="/browse" class="solr.SearchHandler">
<lst name="defaults">
<!-- Query settings -->
<str name="defType">edismax</str>
<str name="qf">
name name_ngram^0.001
</str>
<str name="mm">100%</str>
<str name="q.op">AND</str>
...
This works fine however gives me lots of irrelevant results. Using Solr's analyze capabilities I think I've tracked down the issue to the following cause:
The query is broken down into NGrams. Then Solr searches for either the tokenized query in the text
field or one of the NGrams in the text_ngram
field. Using debug=query
will print out the following parsedquery
when searching for "something":
(+DisjunctionMaxQuery(((text_ngram:som text_ngram:ome text_ngram:met text_ngram:eth text_ngram:thi text_ngram:hin text_ngram:ing) | text:something)))/no_coord
If I read this right it means that either
- One of the NGrams needs to match or
- The original query (tokenized) needs to match
Now this will also find items like "ethernet" as one of the NGrams (eth
) is the same.
My question is: How can I set a higher threshold for the NGram matches? Is there a way to say "only return the item if at least 90% of the NGrams from the query match"? Making sure that 100% of the NGrams match would not make sense as this would effectively kill the fault tolerance.
Another way I thought of was to return only results that are above a certain score threshold relative to the top result. This is because the item "something" will have a very high relevancy compared to "ethernet". So is there a way to hook into Solr to return only results that have eg. at least 1/100th of the score of the top result? I read that there is a way to provide a custom HitCollector
but I couldn't really find any info on this.
Thanks!
回答1:
The idea was to achieve some kind of fault tolerant search. When someone searches for "someting" it should find "something"
Solr's SpellChecker does fuzzy search and you can set thresholds on it http://wiki.apache.org/solr/SpellCheckComponent .
来源:https://stackoverflow.com/questions/17402479/return-only-results-that-match-enough-ngrams-with-solr