Solr - fuzzy search issue with PatternTokenizer Factory

问题

I'm using Solr4.2 in my application. I have changed my text field definition to use the Solr.PatternTokenizerFactory instead of Solr.StandardTokenizerFactory, and changed my schema definition as below

<fieldType name="text_token" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
       <tokenizer class="solr.PatternTokenizerFactory" pattern="[^a-zA-Z0-9&amp;\-']|\d{0,4}s:" />
       <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
       <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
       <tokenizer class="solr.PatternTokenizerFactory" pattern="[^a-zA-Z0-9&amp;\-']|\d{0,4}s:" />
       <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_extra_query.txt" enablePositionIncrements="false" />
       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
</fieldType>

After doing so, fuzzy search does not seem to work properly as it was before.

I'm searching with search term : worde~1

On search, it previously returned around 300 records, while now it returns only 5 records.

These 5 records have words like WORD , WORDS, WORSE. but It do not return other documents with such words.

Not sure what can be issue.

Can anybody help me to make it work?

EDIT :

the regex will split tokens by anything expect alphabets , numbers, '&' , '-' and ns: (where n is number from 0 to 9999, e.g 4323s: )

Lets say for example my text is like below.

this is nice* day & sun 53s: is risen.

Then pattern tokenizer should create tokens as

this is nice day & sun is risen (all words are diff. token )

pattern seem to working fine with different text,

also for fuzzy search worde~1, I have checked the results returns for patterntokenizer factory, having punctuation marks like 'WORDS,' , WORDED.... ,etc.

One more weird thing is, all the results are in uppercase letters, no results with lowercase results come. although it does not return all results of uppercase letters also.

回答1:

I don't think there is much we could do with the "Analyzer" because, it is already working in the expected way. There seems to be no harm in the way it applies tokenizer and filters during indexing and querying.

So, assuming your "Analyzer" part is good, I think the way you perform fuzzy search needs a bit of modification.

The number that you used, in search query (after the ~), decides the precision of your fuzzy search.

"Starting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched."

My suggestion would be to decrease this value to get more search results. By trial and error, you could arrive at the expected level of similarity in fuzzy search, for your requirement.

来源：https://stackoverflow.com/questions/16105450/solr-fuzzy-search-issue-with-patterntokenizer-factory

标签

solr

fuzzy-search