How to fuzzy match email or telephone by Elasticsearch?

前端 未结 1 1483
时光取名叫无心
时光取名叫无心 2020-11-29 10:37

I want to make fuzzy match for email or telephone by Elasticsearch. For example:

match all emails end with @gmail.com

or

match all tele

相关标签:
1条回答
  • 2020-11-29 11:21

    An easy way to do this is to create a custom analyzer which makes use of the n-gram token filter for emails (=> see below index_email_analyzer and search_email_analyzer + email_url_analyzer for exact email matching) and edge-ngram token filter for phones (=> see below index_phone_analyzer and search_phone_analyzer).

    The full index definition is available below.

    PUT myindex
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "email_url_analyzer": {
              "type": "custom",
              "tokenizer": "uax_url_email",
              "filter": [ "trim" ]
            },
            "index_phone_analyzer": {
              "type": "custom",
              "char_filter": [ "digit_only" ],
              "tokenizer": "digit_edge_ngram_tokenizer",
              "filter": [ "trim" ]
            },
            "search_phone_analyzer": {
              "type": "custom",
              "char_filter": [ "digit_only" ],
              "tokenizer": "keyword",
              "filter": [ "trim" ]
            },
            "index_email_analyzer": {
              "type": "custom",
              "tokenizer": "standard",
              "filter": [ "lowercase", "name_ngram_filter", "trim" ]
            },
            "search_email_analyzer": {
              "type": "custom",
              "tokenizer": "standard",
              "filter": [ "lowercase", "trim" ]
            }
          },
          "char_filter": {
            "digit_only": {
              "type": "pattern_replace",
              "pattern": "\\D+",
              "replacement": ""
            }
          },
          "tokenizer": {
            "digit_edge_ngram_tokenizer": {
              "type": "edgeNGram",
              "min_gram": "1",
              "max_gram": "15",
              "token_chars": [ "digit" ]
            }
          },
          "filter": {
            "name_ngram_filter": {
              "type": "ngram",
              "min_gram": "1",
              "max_gram": "20"
            }
          }
        }
      },
      "mappings": {
        "your_type": {
          "properties": {
            "email": {
              "type": "string",
              "analyzer": "index_email_analyzer",
              "search_analyzer": "search_email_analyzer"
            },
            "phone": {
              "type": "string",
              "analyzer": "index_phone_analyzer",
              "search_analyzer": "search_phone_analyzer"
            }
          }
        }
      }
    }
    

    Now, let's dissect it one bit after another.

    For the phone field, the idea is to index phone values with index_phone_analyzer, which uses an edge-ngram tokenizer in order to index all prefixes of the phone number. So if your phone number is 1362435647, the following tokens will be produced: 1, 13, 136, 1362, 13624, 136243, 1362435, 13624356, 13624356, 136243564, 1362435647.

    Then when searching we use another analyzer search_phone_analyzer which will simply take the input number (e.g. 136) and match it against the phone field using a simple match or term query:

    POST myindex
    { 
        "query": {
            "term": 
                { "phone": "136" }
        }
    }
    

    For the email field, we proceed in a similar way, in that we index the email values with the index_email_analyzer, which uses an ngram token filter, which will produce all possible tokens of varying length (between 1 and 20 chars) that can be taken from the email value. For instance: john@gmail.com will be tokenized to j, jo, joh, ..., gmail.com, ..., john@gmail.com.

    Then when searching, we'll use another analyzer called search_email_analyzer which will take the input and try to match it against the indexed tokens.

    POST myindex
    { 
        "query": {
            "term": 
                { "email": "@gmail.com" }
        }
    }
    

    The email_url_analyzer analyzer is not used in this example but I've included it just in case you need to match on the exact email value.

    0 讨论(0)
提交回复
热议问题