How to combine certain set of words into token in Elasticsearch?

孤街醉人 提交于 2020-01-25 09:41:27

问题


For a string like "This is a beautiful day", I want to tokenize the string into tokens: "This, is, a, beautiful, day, beautiful day" where I can specify a certain set of words to combine. In this case only "beautiful" and "day".

So far, I have used Shingle filter to produce the token list like below: "This, This is, is, is a, a, a beautiful, beautiful, beautiful day, day"

How can I further filter the token list above to produce my desired result?

Here is my current code:

shingle_filter = {
    "type": "shingle",
    "min_shingle_size": 2,
    "max_shingle_size": 3,
    "token_separator": " "
  }

body = {'tokenizer':'standard','filter':['lowercase', shingle_filter], 'text':sample_text['content'], 'explain':False}

standard_tokens = analyze_client.analyze(body= body, format='text')

回答1:


After struggling a bit, it seems predicate_token_filter was the one I need.

shingle_filter = {
"type": "shingle",
"token_separator": " "}

predicate_token_filter_temp = {
        "type" : "predicate_token_filter",
    "script" : {
        "source" : "String term = \"beautiful day\"; token.getTerm().toString().equals(term)"
    }}

body = {'tokenizer':'standard','filter':['lowercase', shingle_filter, predicate_token_filter_temp], 'text':sample_text['content'], 'explain':False}

standard_tokens = analyze_client.analyze(body= body, format='text')

I'm not sure this is the best way to do but it gets the job done.



来源:https://stackoverflow.com/questions/59354822/how-to-combine-certain-set-of-words-into-token-in-elasticsearch

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!