I want to make fuzzy match for email or telephone by Elasticsearch. For example:
match all emails end with @gmail.com
or
match all tele
An easy way to do this is to create a custom analyzer which makes use of the n-gram token filter for emails (=> see below index_email_analyzer
and search_email_analyzer
+ email_url_analyzer
for exact email matching) and edge-ngram token filter for phones (=> see below index_phone_analyzer
and search_phone_analyzer
).
The full index definition is available below.
PUT myindex
{
"settings": {
"analysis": {
"analyzer": {
"email_url_analyzer": {
"type": "custom",
"tokenizer": "uax_url_email",
"filter": [ "trim" ]
},
"index_phone_analyzer": {
"type": "custom",
"char_filter": [ "digit_only" ],
"tokenizer": "digit_edge_ngram_tokenizer",
"filter": [ "trim" ]
},
"search_phone_analyzer": {
"type": "custom",
"char_filter": [ "digit_only" ],
"tokenizer": "keyword",
"filter": [ "trim" ]
},
"index_email_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [ "lowercase", "name_ngram_filter", "trim" ]
},
"search_email_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [ "lowercase", "trim" ]
}
},
"char_filter": {
"digit_only": {
"type": "pattern_replace",
"pattern": "\\D+",
"replacement": ""
}
},
"tokenizer": {
"digit_edge_ngram_tokenizer": {
"type": "edgeNGram",
"min_gram": "1",
"max_gram": "15",
"token_chars": [ "digit" ]
}
},
"filter": {
"name_ngram_filter": {
"type": "ngram",
"min_gram": "1",
"max_gram": "20"
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"email": {
"type": "string",
"analyzer": "index_email_analyzer",
"search_analyzer": "search_email_analyzer"
},
"phone": {
"type": "string",
"analyzer": "index_phone_analyzer",
"search_analyzer": "search_phone_analyzer"
}
}
}
}
}
Now, let's dissect it one bit after another.
For the phone
field, the idea is to index phone values with index_phone_analyzer
, which uses an edge-ngram tokenizer in order to index all prefixes of the phone number. So if your phone number is 1362435647
, the following tokens will be produced: 1
, 13
, 136
, 1362
, 13624
, 136243
, 1362435
, 13624356
, 13624356
, 136243564
, 1362435647
.
Then when searching we use another analyzer search_phone_analyzer
which will simply take the input number (e.g. 136
) and match it against the phone
field using a simple match
or term
query:
POST myindex
{
"query": {
"term":
{ "phone": "136" }
}
}
For the email
field, we proceed in a similar way, in that we index the email values with the index_email_analyzer
, which uses an ngram token filter, which will produce all possible tokens of varying length (between 1 and 20 chars) that can be taken from the email value. For instance: john@gmail.com
will be tokenized to j
, jo
, joh
, ..., gmail.com
, ..., john@gmail.com
.
Then when searching, we'll use another analyzer called search_email_analyzer
which will take the input and try to match it against the indexed tokens.
POST myindex
{
"query": {
"term":
{ "email": "@gmail.com" }
}
}
The email_url_analyzer
analyzer is not used in this example but I've included it just in case you need to match on the exact email value.