问题
Background: I use Mysql and there are millions data, each line have twenty columns, we have some complex
search and some column use fuzzy match, such as username like '%aaa%'
, it can't use mysql index unless remove the first %
, but we need fuzzy match to do search like Satckoverflow search, i also checked Mysql fulltext index
, but it doesn't support complex search whthin one sql if using other index.
My solution: add Elasticsearch as our search engine, insert data into Mysql and Es and search data only in Elasticsearch
I checked Elasticsearch fuzzy search, wildcard
works, but many people don't suggest use *
in the word beginning, it will make search very slow.
For example: username: 'John_Snow'
wildcard
works but may very slow
GET /user/_search
{
"query": {
"wildcard": {
"username": "*hn*"
}
}
}
match_phrase
doesn't work seems only work on Tokenizer like phrase 'John Snow'
{
"query": {
"match_phrase":{
"dbName": "hn"
}
}
}
My question: Is there any better solution to do complex query that contains fuzzy match like '%no%' or '%hn_Sn%'.
回答1:
You can use ngram tokenizer that first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.
Adding a working example with index data, mapping, search query, and results.
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 50
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}
Analyze API
POST/ _analyze
{
"analyzer": "my_analyzer",
"text": "John_Snow"
}
The tokens are :
{
"tokens": [
{
"token": "Jo",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "Joh",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "John",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 2
},
{
"token": "oh",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 3
},
{
"token": "ohn",
"start_offset": 1,
"end_offset": 4,
"type": "word",
"position": 4
},
{
"token": "hn",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 5
},
{
"token": "Sn",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 6
},
{
"token": "Sno",
"start_offset": 5,
"end_offset": 8,
"type": "word",
"position": 7
},
{
"token": "Snow",
"start_offset": 5,
"end_offset": 9,
"type": "word",
"position": 8
},
{
"token": "no",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 9
},
{
"token": "now",
"start_offset": 6,
"end_offset": 9,
"type": "word",
"position": 10
},
{
"token": "ow",
"start_offset": 7,
"end_offset": 9,
"type": "word",
"position": 11
}
]
}
Index Data:
{
"title":"John_Snow"
}
Search Query:
{
"query": {
"match" : {
"title" : "hn"
}
}
}
Search Result:
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"title": "John_Snow"
}
}
]
Refer to this blog if you want to do an autocomplete search.
Another search query
{
"query": {
"match" : {
"title" : "ohr"
}
}
}
The above search query shows no result
来源:https://stackoverflow.com/questions/63912422/what-is-the-best-practice-of-fuzzy-search-like-aaa-in-mysql-in-elasticsear