elasticsearch tokenize “H&R Blocks” as “H”, “R”, “H&R”, “Blocks”

问题

I want to preserve the special character in the token, meanwhile still tokenize special characters. Say I have the word

"H&R Blocks"

I want to tokenize it as

"H", "R", "H&R", "Blocks"

I read this post http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html . It explained how to preserve the special character.

回答1:

Try using the word_delimiter token filter.

Reading the docs on its use you an set the parameter preserve_original: true to do exactly what you want (i.e. "H&R" => H&R H R).

I would set it up like this:

"settings" : {
    "analysis" : {
        "filter" : {
            "special_character_spliter" : {
                "type" : "word_delimiter",
                "preserve_original": "true"
            }   
        },
        "analyzer" : {
            "your_analyzer" : {
                "type" : "custom",
                "tokenizer" : "whitespace",
                "filter" : ["lowercase", "special_character_spliter"]
            }
        }
    }
}

Good luck!

回答2:

"settings" : { 
   "analysis" : {
       "filter" : {
           "blocks_filter" : {
               "type" : "word_delimiter",
               "preserve_original": "true"
           },
          "shingle":{
              "type":"shingle",
              "max_shingle_size":5,
              "min_shingle_size":2,
              "output_unigrams":"true"
           },
           "filter_stop":{
              "type":"stop",
              "enable_position_increments":"false"
           }
       },
       "analyzer" : {
           "blocks_analyzer" : {
               "type" : "custom",
               "tokenizer" : "whitespace",
               "filter" : ["lowercase", "blocks_filter", "shingle"]
           }
       }
   }
},
"mappings" : {
   "type" : {
       "properties" : {
           "company" : {
               "type" : "string",
               "analyzer" : "blocks_analyzer"
           }
       }
   }
}

来源：https://stackoverflow.com/questions/18223101/elasticsearch-tokenize-hr-blocks-as-h-r-hr-blocks

标签

ElasticSearch

token

tokenize

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!