SOLR and accented characters

后端 未结 3 2033
生来不讨喜
生来不讨喜 2021-01-27 07:12

I have an index for occupations (identifier + occupation):




        
相关标签:
3条回答
  • 2021-01-27 07:32

    I don't think mysql or your jvm settings have anything to do with this. I suspect one works and the other does not probably due to the SpanishLightStemFilterFactory.

    The right way to achieve matching no matter the diacritics is to use the following:

      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
    

    Put that before your tokenizer in both index and query analyzer chains, and any diacritic should be converted to the ascii version. That would make it work always.

    0 讨论(0)
  • 2021-01-27 07:34

    Ok, I have discovered the source problem. I have opened my SQL load script with VI, in hex mode.

    This is the hex content for 'Agrónomo' in an INSERT statement: 41 67 72 6f cc 81 6e 6f 6d 6f.

    6f cc 81!!!! This is "o COMBINING ACUTE ACCENT" UTF code!!!!
    

    So that's the problem... It must be "c3 b3"... I get the literals copy/pasting from a web page, so the source characters on the origin was the problem.

    Thanks to both of you, because I have learning more about SOLR's soul.

    Regards.

    0 讨论(0)
  • 2021-01-27 07:36

    Just add solr.ASCIIFoldingFilterFactory to your filter analyzer chain or even better create a new fieldType:

    <!-- Spanish -->
    <fieldType name="text_es_ascii_folding" class="solr.TextField" positionIncrementGap="100">
      <analyzer> 
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_es.txt" format="snowball" />
        <filter class="solr.SpanishLightStemFilterFactory"/>
      </analyzer>
    </fieldType>
    

    This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the Basic Latin Unicode block (the first 127 ASCII characters) to their ASCII equivalents, if one exists.

    This should let you to match the search even if the accented character is missing. The downside is that words like "cañon" and "canon" are now equivalent and both hit the same documents IIRC.

    0 讨论(0)
提交回复
热议问题