I am trying to index strings that contain hyphens but do not contain spaces, periods or any other punctuation. I do not want to split up the words based on hyphens, instead I wo
You're on the right path, however, you need to also add another analyzer that leverages the edge-ngram token filter in order to make the "starts with" contraint work. You can keep the ngram
for checking fields that "contain" a given word, but you need edge-ngram
to check that a field "starts with" some token.
PUT /sample
{
"settings": {
"index.number_of_shards": 5,
"index.number_of_replicas": 0,
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
},
"edgenGram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"ngram_index_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"nGram_filter"
]
},
"edge_ngram_index_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"edgenGram_filter"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "string",
"fields": {
"prefixes": {
"type": "string",
"analyzer": "edge_ngram_index_analyzer",
"search_analyzer": "standard"
},
"substrings": {
"type": "string",
"analyzer": "ngram_index_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
}
}
Then your query will become (i.e. search for all documents whose name
field contains play
or starts with magazine
)
POST /sample/test/_search
{
"query": {
"bool": {
"minimum_should_match": 1,
"should": [
{"match": { "name.substrings": "play" }},
{"match": { "name.prefixes": "magazine" }}
]
}
}
}
Note: don't use wildcard
for searching for substrings, as it will kill the performance of your cluster (more info here and here)