Elasticsearch custom analyzer with ngram and without word delimiter on hyphens

前端未结

关注

 1  1661

I am trying to index strings that contain hyphens but do not contain spaces, periods or any other punctuation. I do not want to split up the words based on hyphens, instead I wo

相关标签:

1条回答

迷失自我

2021-01-21 14:25

You're on the right path, however, you need to also add another analyzer that leverages the edge-ngram token filter in order to make the "starts with" contraint work. You can keep the ngram for checking fields that "contain" a given word, but you need edge-ngram to check that a field "starts with" some token.

PUT /sample
{
  "settings": {
    "index.number_of_shards": 5,
    "index.number_of_replicas": 0,
    "analysis": {
      "filter": {
        "nGram_filter": {
          "type": "nGram",
          "min_gram": 2,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit"
          ]
        },
        "edgenGram_filter": {
          "type": "edgeNGram",
          "min_gram": 2,
          "max_gram": 20
        }
      },
      "analyzer": {
        "ngram_index_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase",
            "nGram_filter"
          ]
        },
        "edge_ngram_index_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase",
            "edgenGram_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "name": {
          "type": "string",
          "fields": {
            "prefixes": {
              "type": "string",
              "analyzer": "edge_ngram_index_analyzer",
              "search_analyzer": "standard"
            },
            "substrings": {
              "type": "string",
              "analyzer": "ngram_index_analyzer",
              "search_analyzer": "standard"
            }
          }
        }
      }
    }
  }
}

Then your query will become (i.e. search for all documents whose name field contains play or starts with magazine)

POST /sample/test/_search
{
    "query": {
        "bool": {
            "minimum_should_match": 1,
            "should": [
                {"match": { "name.substrings": "play" }},
                {"match": { "name.prefixes": "magazine" }}
            ]
        }
    }
}

Note: don't use wildcard for searching for substrings, as it will kill the performance of your cluster (more info here and here)

0 讨论(0)