Elasticsearch custom analyzer with ngram and without word delimiter on hyphens

前端 未结 1 1660
一整个雨季
一整个雨季 2021-01-21 13:23

I am trying to index strings that contain hyphens but do not contain spaces, periods or any other punctuation. I do not want to split up the words based on hyphens, instead I wo

1条回答
  •  迷失自我
    2021-01-21 14:25

    You're on the right path, however, you need to also add another analyzer that leverages the edge-ngram token filter in order to make the "starts with" contraint work. You can keep the ngram for checking fields that "contain" a given word, but you need edge-ngram to check that a field "starts with" some token.

    PUT /sample
    {
      "settings": {
        "index.number_of_shards": 5,
        "index.number_of_replicas": 0,
        "analysis": {
          "filter": {
            "nGram_filter": {
              "type": "nGram",
              "min_gram": 2,
              "max_gram": 20,
              "token_chars": [
                "letter",
                "digit"
              ]
            },
            "edgenGram_filter": {
              "type": "edgeNGram",
              "min_gram": 2,
              "max_gram": 20
            }
          },
          "analyzer": {
            "ngram_index_analyzer": {
              "type": "custom",
              "tokenizer": "keyword",
              "filter": [
                "lowercase",
                "nGram_filter"
              ]
            },
            "edge_ngram_index_analyzer": {
              "type": "custom",
              "tokenizer": "keyword",
              "filter": [
                "lowercase",
                "edgenGram_filter"
              ]
            }
          }
        }
      },
      "mappings": {
        "test": {
          "properties": {
            "name": {
              "type": "string",
              "fields": {
                "prefixes": {
                  "type": "string",
                  "analyzer": "edge_ngram_index_analyzer",
                  "search_analyzer": "standard"
                },
                "substrings": {
                  "type": "string",
                  "analyzer": "ngram_index_analyzer",
                  "search_analyzer": "standard"
                }
              }
            }
          }
        }
      }
    }
    

    Then your query will become (i.e. search for all documents whose name field contains play or starts with magazine)

    POST /sample/test/_search
    {
        "query": {
            "bool": {
                "minimum_should_match": 1,
                "should": [
                    {"match": { "name.substrings": "play" }},
                    {"match": { "name.prefixes": "magazine" }}
                ]
            }
        }
    }
    

    Note: don't use wildcard for searching for substrings, as it will kill the performance of your cluster (more info here and here)

    0 讨论(0)
提交回复
热议问题