问题
I want to preserve the special character in the token, meanwhile still tokenize special characters. Say I have the word
"H&R Blocks"
I want to tokenize it as
"H", "R", "H&R", "Blocks"
I read this post http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html . It explained how to preserve the special character.
回答1:
Try using the word_delimiter
token filter.
Reading the docs on its use you an set the parameter preserve_original: true
to do exactly what you want (i.e. "H&R" => H&R
H
R
).
I would set it up like this:
"settings" : {
"analysis" : {
"filter" : {
"special_character_spliter" : {
"type" : "word_delimiter",
"preserve_original": "true"
}
},
"analyzer" : {
"your_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["lowercase", "special_character_spliter"]
}
}
}
}
Good luck!
回答2:
"settings" : {
"analysis" : {
"filter" : {
"blocks_filter" : {
"type" : "word_delimiter",
"preserve_original": "true"
},
"shingle":{
"type":"shingle",
"max_shingle_size":5,
"min_shingle_size":2,
"output_unigrams":"true"
},
"filter_stop":{
"type":"stop",
"enable_position_increments":"false"
}
},
"analyzer" : {
"blocks_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["lowercase", "blocks_filter", "shingle"]
}
}
}
},
"mappings" : {
"type" : {
"properties" : {
"company" : {
"type" : "string",
"analyzer" : "blocks_analyzer"
}
}
}
}
来源:https://stackoverflow.com/questions/18223101/elasticsearch-tokenize-hr-blocks-as-h-r-hr-blocks