Is it possible in ElasticSearch to form a query that would preserve the ordering of the terms?
A simple example would be having these documents indexed using standar
This is exactly what a match_phrase
query (see here) does.
It checks the position of the terms, on top of their presence.
For example, these documents :
POST test/values
{
"test": "Hello World"
}
POST test/values
{
"test": "Hello nice World"
}
POST test/values
{
"test": "World, I don't say hello"
}
will all be found with the basic match
query :
POST test/_search
{
"query": {
"match": {
"test": "Hello World"
}
}
}
But using a match_phrase
, only the first document will be returned :
POST test/_search
{
"query": {
"match_phrase": {
"test": "Hello World"
}
}
}
{
...
"hits": {
"total": 1,
"max_score": 2.3953633,
"hits": [
{
"_index": "test",
"_type": "values",
"_id": "qFZAKYOTQh2AuqplLQdHcA",
"_score": 2.3953633,
"_source": {
"test": "Hello World"
}
}
]
}
}
In your case, you want to accept to have some distance between your terms. This can be achieved with the slop
parameter, which indicate how far you allow your terms to be one from another :
POST test/_search
{
"query": {
"match": {
"test": {
"query": "Hello world",
"slop":1,
"type": "phrase"
}
}
}
}
With this last request, you find the second document too :
{
...
"hits": {
"total": 2,
"max_score": 0.38356602,
"hits": [
{
"_index": "test",
"_type": "values",
"_id": "7mhBJgm5QaO2_aXOrTB_BA",
"_score": 0.38356602,
"_source": {
"test": "Hello World"
}
},
{
"_index": "test",
"_type": "values",
"_id": "VKdUJSZFQNCFrxKk_hWz4A",
"_score": 0.2169777,
"_source": {
"test": "Hello nice World"
}
}
]
}
}
You can find a whole chapter about this use case in the definitive guide.
You could use a span_near
query, it has a in_order
parameter.
{
"query": {
"span_near": {
"clauses": [
{
"span_term": {
"field": "you"
}
},
{
"span_term": {
"field": "search"
}
}
],
"slop": 2,
"in_order": true
}
}
}
Phrase matching doesn't ensure order ;-). If you specify enough slopes -like 2, for example - "hello world" will match "world hello". But this is not necessarily a bad thing because usually searches are more relevant if two terms are "close" to each other and it doesn't matter their order. And I don't think authors of this feature thought of matching words that are 1000 slops apart.
There is a solution that I could find to keep the order, not simple though: using scripts. Here's one example:
POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "title": "hello world" }
{ "index": { "_id": 2 }}
{ "title": "world hello" }
{ "index": { "_id": 3 }}
{ "title": "hello term1 term2 term3 term4 world" }
POST my_index/_search
{
"query": {
"filtered": {
"query": {
"match": {
"title": {
"query": "hello world",
"slop": 5,
"type": "phrase"
}
}
},
"filter": {
"script": {
"script": "term1Pos=0;term2Pos=0;term1Info = _index['title'].get('hello',_POSITIONS);term2Info = _index['title'].get('world',_POSITIONS); for(pos in term1Info){term1Pos=pos.position;}; for(pos in term2Info){term2Pos=pos.position;}; return term1Pos<term2Pos;",
"params": {}
}
}
}
}
}
To make the script itself more readable, I am rewriting here with indentations:
term1Pos = 0;
term2Pos = 0;
term1Info = _index['title'].get('hello',_POSITIONS);
term2Info = _index['title'].get('world',_POSITIONS);
for(pos in term1Info) {
term1Pos = pos.position;
};
for(pos in term2Info) {
term2Pos = pos.position;
};
return term1Pos < term2Pos;
Above is a query that searches for "hello world" with a slop of 5 which in the docs above will match all of them. But the scripted filter will ensure that the position in document of word "hello" is lower than the position in document for word "world". In this way, no matter how many slops we set in the query, the fact that the positions are one after the other ensures the order.
This is the section in the documentation that sheds some light on the things used in the script above.