How to correctly configure solr stemming

淺唱寂寞╮ 提交于 2019-12-11 03:39:31

问题


I have configured a field in Solr as follows. When I search for the word "Conditioner", I was hoping to find words that contain "Conditioning" also. But based on Solr Analysis, the porterstemfilter is cutting the word "Conditioning" to "Condit" at index time. Hence, at the search time, when I query for "Conditioner", it is stemmed as "Condition" and hence not matching "Conditioning".

How to configure stemming so that both Conditioner and Conditioning should stem to condition?

<fieldType name="text_general" class="solr.TextField"
           positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory" 
            generateWordParts="1" generateNumberParts="1" 
            catenateWords="1" catenateNumbers="1" catenateAll="0" 
            splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
            ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory"
            generateWordParts="1" generateNumberParts="1"
            catenateWords="0" catenateNumbers="0" catenateAll="0"
            splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

回答1:


I would also suggest to try a different Stemmer. There are 4 included in Solr

  1. solr.PorterStemFilterFactory
  2. solr.SnowballPorterFilterFactory
  3. solr.KStemFilterFactory
  4. solr.HunspellStemFilterFactory (you will need a dictionary for this one from an external source, like open office)

Each of those produces different results for your problem, see below. Given the results and that you do not need an external resource, I would also opt for KStem. If you do not fear to include a dictionary, I would go for hunspell.

  1. porter
    • Conditioner -> condition
    • Conditioning -> condit
  2. snowballporter
    • Conditioner -> condition
    • Conditioning -> condit
  3. kstem
    • Conditioner -> condition
    • Conditioning -> condition
  4. hunspell with en_GB
    • Conditioner -> condition
    • Conditioning -> conditioning; condition



回答2:


If only this particular case is important, you could override the stemmer:

StemmerOverrideFilterFactory

If the Porter stemmer is generally too aggressive, then try another stemmer like KStem.



来源:https://stackoverflow.com/questions/27516556/how-to-correctly-configure-solr-stemming

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!