SOLR and accented characters

问题

I have an index for occupations (identifier + occupation):

<field name="occ_id" type="int" indexed="true" stored="true" required="true" />
<field name="occ_tx_name" type="text_es" indexed="true" stored="true" multiValued="false" />


<!-- Spanish -->
<fieldType name="text_es" class="solr.TextField" positionIncrementGap="100">
  <analyzer> 
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_es.txt" format="snowball" />
    <filter class="solr.SpanishLightStemFilterFactory"/>
  </analyzer>
</fieldType>

This is a real query, for three identifiers (1, 195 and 129):

curl -X GET "http://192.168.1.11:8983/solr/cyp_occupations/select?indent=on&q=occ_id:1+occ_id:195+occ_id:129&wt=json"
{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"occ_id:1 occ_id:195 occ_id:129",
      "indent":"on",
      "wt":"json"}},
  "response":{"numFound":3,"start":0,"docs":[
      {
        "occ_id":1,
        "occ_tx_name":"Abogado",
        "_version_":1565225103805906944},
      {
        "occ_id":129,
        "occ_tx_name":"Informático",
        "_version_":1565225103843655680},
      {
        "occ_id":195,
        "occ_tx_name":"Osteópata",
        "_version_":1565225103858335746}]
  }}

Two of them have accented characters, and one not. So let’s search by occ_tx_name without using accents:

curl -X GET "http://192.168.1.11:8983/solr/cyp_occupations/select?indent=on&q=occ_tx_name:abogado&wt=json"
{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"occ_tx_name:abogado",
      "indent":"on",
      "wt":"json"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "occ_id":1,
        "occ_tx_name":"Abogado",
        "_version_":1565225103805906944}]
  }}

curl -X GET "http://192.168.1.11:8983/solr/cyp_occupations/select?indent=on&q=occ_tx_name:informatico&wt=json"
{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"occ_tx_name:informatico",
      "indent":"on",
      "wt":"json"}},
  "response":{"numFound”:1,”start":0,"docs":[
      {
        "occ_id":129,
        "occ_tx_name":"Informático",
        "_version_":1565225103843655680}]
  }}


curl -X GET "http://192.168.1.11:8983/solr/cyp_occupations/select?indent=on&q=occ_tx_name:osteopata&wt=json"
{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"occ_tx_name:osteopata",
      "indent":"on",
      "wt":"json"}},
  "response":{"numFound":0,"start":0,"docs":[]
  }}

I am very annoying about the fact that the last search ‘osteopata’ fails, while ‘informatico’ succeed. The source data for the index is a simple MySQL table:

-- -----------------------------------------------------
-- Table `mydb`.`occ_occupation`
-- -----------------------------------------------------
CREATE TABLE IF NOT EXISTS `mydb`.`occ_occupation` (
  `occ_id` INT UNSIGNED NOT NULL,
  `occ_tx_name` VARCHAR(255) NOT NULL,
  PRIMARY KEY (`occ_id`)
ENGINE = InnoDB

The collation of the table is “utf8mb4_general_ci”. The index is created with DataImportHandler. This is the definition:

<dataConfig>
    <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://192.168.1.11:3306/mydb"
        user=“mydb” password=“mydb” />
    <document name="occupations">
        <entity name="occupation" pk="occ_id"
            query="SELECT occ.occ_id, occ.occ_tx_name FROM occ_occupation occ WHERE occ.sta_bo_deleted = false">
            <field column="occ_id" name="occ_id" />
            <field column="occ_tx_name" name="occ_tx_name" />
        </entity>
    </document>
</dataConfig>

I need some clue to detect the problem. Can anyone help me? Thanks in advance.

回答1:

I don't think mysql or your jvm settings have anything to do with this. I suspect one works and the other does not probably due to the SpanishLightStemFilterFactory.

The right way to achieve matching no matter the diacritics is to use the following:

  <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>

Put that before your tokenizer in both index and query analyzer chains, and any diacritic should be converted to the ascii version. That would make it work always.

回答2:

Just add solr.ASCIIFoldingFilterFactory to your filter analyzer chain or even better create a new fieldType:

<!-- Spanish -->
<fieldType name="text_es_ascii_folding" class="solr.TextField" positionIncrementGap="100">
  <analyzer> 
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_es.txt" format="snowball" />
    <filter class="solr.SpanishLightStemFilterFactory"/>
  </analyzer>
</fieldType>

This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the Basic Latin Unicode block (the first 127 ASCII characters) to their ASCII equivalents, if one exists.

This should let you to match the search even if the accented character is missing. The downside is that words like "cañon" and "canon" are now equivalent and both hit the same documents IIRC.

回答3:

Ok, I have discovered the source problem. I have opened my SQL load script with VI, in hex mode.

This is the hex content for 'Agrónomo' in an INSERT statement: 41 67 72 6f cc 81 6e 6f 6d 6f.

6f cc 81!!!! This is "o COMBINING ACUTE ACCENT" UTF code!!!!

So that's the problem... It must be "c3 b3"... I get the literals copy/pasting from a web page, so the source characters on the origin was the problem.

Thanks to both of you, because I have learning more about SOLR's soul.

Regards.

来源：https://stackoverflow.com/questions/43529068/solr-and-accented-characters

标签

solr

accent-insensitive