i have read various threads about how to remove accents during index/query time. The current fieldtype i have come up with looks like the following:
The issue is you are applying StandardTokenizerFactory
before applying the ASCIIFoldingFilterFactory
. Instead you should use the MappingCharFilterFactory
character filter factory first and the the StandardTokenizerFactory
.
As per the Solr Reference guide StandardTokenizerFactory
supports <ALPHANUM>, <NUM>, <SOUTHEAST_ASIAN>, <IDEOGRAPHIC>, and <HIRAGANA>
. Therefore when you tokenize using StandardTokenizerFactory
the umlaut characters are lost and your ASCIIFoldingFilterFactory
is of no use after that.
Your fieldType
should be like below if you want to go for StandardTokenizerFactory
.
<fieldType name="text_general" class="solr.TextField">
<analyzer>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
The mapping-ISOLatin1Accent.txt
should have the mappings for such "special" characters. In Solr this file comes pre-populated by default. For e.g. ü -> ue
, ä -> ae
, etc.