How to configure stemming in Solr?

不问归期 提交于 2020-01-10 19:35:32

问题


I add to solr index: "American". When I search by "America" there is no results.

How should schema.xml be configured to get results?

current configuration:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory" />
                <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" />
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
                <filter class="solr.PorterStemFilterFactory"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory" />
                <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" />
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
                <filter class="solr.PorterStemFilterFactory"/>
            </analyzer>
        </fieldType>

回答1:


Why would you have two stemmers?
Try removing EnglishPorterFilterFactory (deprecated) from both of your analyzer types, rebuild the index and then try whether search for American will yield America.

If that wont work, the other thing you can try is to remove both of your stemmer filters and add SnowballPorterFilterFactory with language="English" instead.




回答2:


You have to use one stemmer for an analyzer and EnglishPorterFilterFactory is deprecated as @Marko already mentioned. So you should remove this one from analyzers.

I used SnowballPorterFilterFactory for both index and query analyzers -

<fieldType name="text_stem">
    <analyzer> 
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SnowballPorterFilterFactory"/>
        <!-- other filters -->
    </analyzer>
</fieldType>

The fieldType definition is pretty self explanatory, but just in case:

  • Tokenizer solr.WhitespaceTokenizerFactory: This operation will break up the sentences into words, using whitespaces as delimiters.

  • Filter solr.SnowballPorterFilterFactory: This filter will apply a stemming algorithm to each word (token). In the example above I have chosen the Snowball Porter stemming algorithm. Solr provides a few implementation of popular stemming algorithms.

You can browse several other stemming algorithms e.g. HunspellStemFilterFactory, KStemFilterFactory too.



来源:https://stackoverflow.com/questions/5285916/how-to-configure-stemming-in-solr

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!