Elasticsearch: index a field with keyword tokenizer but without stopwords

拈花ヽ惹草 提交于 2019-12-11 01:42:49

问题


I am looking for a way to search company names with keyword tokenizing but without stopwords.

For ex : The indexed company name is "Hansel und Gretel Gmbh."

Here "und" and "Gmbh" are stop words for the company name.

If the search term is "Hansel Gretel", that document should be found, If the search term is "Hansel" then no document should be found. And if the search term is "hansel gmbh", the no document should be found as well.

I have tried to combine keywords tokenizer with stopwords in custom analyzer but it didnt work(as expected I guess).

I have also tried to use common terms query, but "Hansel" started to hit(again as expected)

Thanks in advance.


回答1:


There are two ways bad and ugly. The first one uses regular expressions in order to remove stop words and trim spaces. There are a lot of drawbacks:

  • you have to support white-space tokenization(regexp(/s+)) and special symbol(.,;) removal by your own
  • no highlight is supported - keyword tokenizer does not support
  • case sensitivity is also a problem
  • normalizers(analyzers for keywords) are experimental feature - bad support, no features

Here is step-by-step example:

curl -XPUT "http://localhost:9200/test" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "normalizer": {
        "custom_normalizer": {
          "type": "custom",
          "char_filter": ["stopword_char_filter", "trim_char_filter"],
          "filter": ["lowercase"]
        }
      },
      "char_filter": {
        "stopword_char_filter": {
          "type": "pattern_replace",
          "pattern": "( ?und ?| ?gmbh ?)",
          "replacement": " "
        },
        "trim_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\s+)$",
          "replacement": ""
        }
      }
    }
  },
  "mappings": {
    "file": {
      "properties": {
        "name": {
          "type": "keyword",
          "normalizer": "custom_normalizer"
        }
      }
    }
  }
}'

Now we can check how our analyzer works(please note that requests to normalyzer are supported only in ES 6.x)

curl -XPOST "http://localhost:9200/test/_analyze" -H 'Content-Type: application/json' -d'
{
  "normalizer": "custom_normalizer",
  "text": "hansel und gretel gmbh"
}'

Now we are ready to index our document:

curl -XPUT "http://localhost:9200/test/file/1" -H 'Content-Type: application/json' -d'
{
  "name": "hansel und gretel gmbh"
}'

And the last step is search:

curl -XGET "http://localhost:9200/test/_search" -H 'Content-Type: application/json' -d'
{
    "query": {
        "match" : {
            "name" : {
                "query" : "hansel gretel"
            }
        }
    }
}'

Another approach is:

  • create standard text analyzer with stop words filter
  • use analysis to filter out all stop words and special symbols
  • concatenate tokens manually
  • send term to ES as keyword

Here is step-by-step example:

curl -XPUT "http://localhost:9200/test" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer": {
          "type":      "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "custom_stopwords"]
        }
      }, "filter": {
        "custom_stopwords": {
          "type": "stop",
          "stopwords": ["und", "gmbh"]
        }
      }
    }
  },
  "mappings": {
    "file": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "custom_analyzer"
        }
      }
    }
  }
}' 

Now we are ready to analyze our text:

POST test/_analyze
{
  "analyzer": "custom_analyzer",
  "text": "Hansel und Gretel Gmbh."
}

with the following result:

{
  "tokens": [
    {
      "token": "hansel",
      "start_offset": 0,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "gretel",
      "start_offset": 11,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

The last step is token concatenation: hansel + gretel. The only drawback is manual analysis with custom code.



来源:https://stackoverflow.com/questions/48097075/elasticsearch-index-a-field-with-keyword-tokenizer-but-without-stopwords

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!