elasticsearch tokenize “H&R Blocks” as “H”, “R”, “H&R”, “Blocks”

落花浮王杯 提交于 2019-12-11 02:03:27

问题


I want to preserve the special character in the token, meanwhile still tokenize special characters. Say I have the word

"H&R Blocks"

I want to tokenize it as

"H", "R", "H&R", "Blocks"

I read this post http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html . It explained how to preserve the special character.


回答1:


Try using the word_delimiter token filter.

Reading the docs on its use you an set the parameter preserve_original: true to do exactly what you want (i.e. "H&R" => H&R H R).

I would set it up like this:

"settings" : {
    "analysis" : {
        "filter" : {
            "special_character_spliter" : {
                "type" : "word_delimiter",
                "preserve_original": "true"
            }   
        },
        "analyzer" : {
            "your_analyzer" : {
                "type" : "custom",
                "tokenizer" : "whitespace",
                "filter" : ["lowercase", "special_character_spliter"]
            }
        }
    }
}

Good luck!




回答2:


"settings" : { 
   "analysis" : {
       "filter" : {
           "blocks_filter" : {
               "type" : "word_delimiter",
               "preserve_original": "true"
           },
          "shingle":{
              "type":"shingle",
              "max_shingle_size":5,
              "min_shingle_size":2,
              "output_unigrams":"true"
           },
           "filter_stop":{
              "type":"stop",
              "enable_position_increments":"false"
           }
       },
       "analyzer" : {
           "blocks_analyzer" : {
               "type" : "custom",
               "tokenizer" : "whitespace",
               "filter" : ["lowercase", "blocks_filter", "shingle"]
           }
       }
   }
},
"mappings" : {
   "type" : {
       "properties" : {
           "company" : {
               "type" : "string",
               "analyzer" : "blocks_analyzer"
           }
       }
   }
}


来源:https://stackoverflow.com/questions/18223101/elasticsearch-tokenize-hr-blocks-as-h-r-hr-blocks

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!