问题
Say if I want to filter documents by some field within 10th to 20th percentile. I'm wondering if it's possible by some simple query, something like {"fieldName":{"percentile": [0.1, 0.2]}}
.
Say I have these documents:
[{"a":1,"b":101},{"a":2,"b":102},{"a":3,"b":103}, ..., {"a":100,"b":200}]
I need to filter the top 10th of them by a
(with ascending order), that would be a
from 1 to 10. Then I need to sort those results by b
with descending order, then take the paginated result (like page No.2, with 10 items every page).
One solution in mind would be:
get the total count of the documents.
sort the documents by
a
, take the corresponding_id
with limit0.1 * total_count
write the final query, something like
id in (...) order by b
But the shortcomings are pretty obvious too:
seems not effecient if we're talking about subsecond latency
the second query might not work if we have too many
_id
returned in the first query (ES only allows 1000 by default. I can change the config of course, but there's always a limit).
回答1:
I doubt that there is a way to do this in one query if the exact values of a
are not known beforehand, although I think one pretty efficient approach is feasible.
I would suggest to do a percentiles aggregation as first query and range query as second.
In my sample index I have only 14 documents, so for explanatory reasons I will try to find those documents that are from 30% to 60% of field a
and sort them by field b
in inverse order (so to be sure that sort worked).
Here are the docs I inserted:
{"a":1,"b":101}
{"a":5,"b":105}
{"a":10,"b":110}
{"a":2,"b":102}
{"a":6,"b":106}
{"a":7,"b":107}
{"a":9,"b":109}
{"a":4,"b":104}
{"a":8,"b":108}
{"a":12,"b":256}
{"a":13,"b":230}
{"a":14,"b":215}
{"a":3,"b":103}
{"a":11,"b":205}
Let's find out which are the bounds for field a
between 30% and 60% percentiles:
POST my_percent/doc/_search
{
"size": 0,
"aggs" : {
"percentiles" : {
"percentiles" : {
"field" : "a",
"percents": [ 30, 60, 90 ]
}
}
}
}
With my sample index it looks like this:
{
...
"hits": {
"total": 14,
"max_score": 0,
"hits": []
},
"aggregations": {
"percentiles": {
"values": {
"30.0": 4.9,
"60.0": 8.8,
"90.0": 12.700000000000001
}
}
}
}
Now we can use the boundaries to do the range
query:
POST my_percent/doc/_search
{
"query": {
"range": {
"a" : {
"gte" : 4.9,
"lte" : 8.8
}
}
},
"sort": {
"b": "desc"
}
}
And the result is:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 4,
"max_score": null,
"hits": [
{
"_index": "my_percent",
"_type": "doc",
"_id": "vkFvYGMB_zM1P5OLcYkS",
"_score": null,
"_source": {
"a": 8,
"b": 108
},
"sort": [
108
]
},
{
"_index": "my_percent",
"_type": "doc",
"_id": "vUFvYGMB_zM1P5OLWYkM",
"_score": null,
"_source": {
"a": 7,
"b": 107
},
"sort": [
107
]
},
{
"_index": "my_percent",
"_type": "doc",
"_id": "vEFvYGMB_zM1P5OLRok1",
"_score": null,
"_source": {
"a": 6,
"b": 106
},
"sort": [
106
]
},
{
"_index": "my_percent",
"_type": "doc",
"_id": "u0FvYGMB_zM1P5OLJImy",
"_score": null,
"_source": {
"a": 5,
"b": 105
},
"sort": [
105
]
}
]
}
}
Note that the results of percentiles
aggregation are approximate.
In general, this looks like a task better solved by pandas or a Spark job.
Hope that helps!
来源:https://stackoverflow.com/questions/50166949/elasticsearch-filter-by-percentile