Rails sunspot-solr - words with hyphen

I'm using the sunspot_rails gem and everything is working perfect so far but: I'm not getting any search results for words with a hyphen.

Example: The string "tron" returns a lot of results(the word mentioned in all articles is e-tron)

The string "e-tron" returns 0 results even though this is the correct word mentioned in all my articles.

My current schema.xml config:

    <fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

What I want: The behaviour for the search string tron is okay of course, but I also want to have the correct matches for the search string e-tron.

polmiro

The problem is that solr.StandardTokenizerFactory is splitting words by hyphens so "e-tron" generates the tokens "e", "tron". Presumably "e" is lost as solr.TextField filters with a minimum token size of 2.

This is one example that would show your specific problem.

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

solr.WhitespaceTokenizerFactory will generate tokens on whitespace. ["e-tron"]
solr.WordDelimiterFilterFactory will split on hyphens but also preserve the original word. ["e", "tron", "e-tron"]

来源：https://stackoverflow.com/questions/17225344/rails-sunspot-solr-words-with-hyphen

标签

ruby-on-rails

n-gram

sunspot-solr