Solr accent removal

非 Y 不嫁゛ 提交于 2019-12-22 00:29:18

问题


i have read various threads about how to remove accents during index/query time. The current fieldtype i have come up with looks like the following:

<fieldType name="text_general" class="solr.TextField">     
    <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.ASCIIFoldingFilterFactory"/>
            <filter class="solr.LowerCaseFilterFactory" />
    </analyzer>     
</fieldType>

After having added a couple of test information to index i have checked via http://localhost:8080/solr/test_core/admin/luke?fl=title

which kind of tokens have been generated. For instance a title like "Bayern München" has been tokenized into:

<int name="bayern">1</int>
<int name="m">1</int>
<int name="nchen">1</int>

Therefore instead of replacing the character by its ascii pendant, it has been interpret as being a delimiter?! Having that kind of index results into that i neither can search for "münchen" nor m?nchen.

Any idea how to fix? Thanks in advance.


回答1:


The issue is you are applying StandardTokenizerFactory before applying the ASCIIFoldingFilterFactory. Instead you should use the MappingCharFilterFactory character filter factory first and the the StandardTokenizerFactory.

As per the Solr Reference guide StandardTokenizerFactory supports <ALPHANUM>, <NUM>, <SOUTHEAST_ASIAN>, <IDEOGRAPHIC>, and <HIRAGANA>. Therefore when you tokenize using StandardTokenizerFactory the umlaut characters are lost and your ASCIIFoldingFilterFactory is of no use after that.

Your fieldType should be like below if you want to go for StandardTokenizerFactory.

<fieldType name="text_general" class="solr.TextField">     
    <analyzer>
            <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory" />
    </analyzer>     
</fieldType>

The mapping-ISOLatin1Accent.txt should have the mappings for such "special" characters. In Solr this file comes pre-populated by default. For e.g. ü -> ue, ä -> ae, etc.



来源:https://stackoverflow.com/questions/17162163/solr-accent-removal

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!