Problem:
If I search for \"iphone\" I get 400 product results and the product category aggregation I have returns the top 3 categories in the result
You are looking for Sampler Aggregation. I have a similar answer at Aggregation on top n results
{
"aggs": {
"bestDocs": {
"sampler": {
"shard_size":100
},
"aggs": {
"product_categories": {
"terms": {
"field": "product_category",
"size": 3
}
}
}
}
}
It will take the top 100 docs sorted by their scores and then do term aggregation.
You could try using a filter
aggregation based on a limit
filter, and nest your terms
aggregation in it.
Be aware that the limit is applied at shard level (see the documentation).
However, this should do the job for your case, with a query like :
{
"aggs": {
"limit_results": {
"filter": {
"limit": {
"value": 100
}
},
"aggs": {
"product_categories": {
"terms": {
"field": "product_category",
"size": 10
}
}
}
}
}
}
Before I begin, please note that this not a perfect solution to the question. However, it could definitively ease the situation and in a special case it actually is a perfect solution.
The solution I propose goes by sorting the terms aggregation buckets by the score of the document they were found in. That is, the ordering of the terms is no longer only by frequency but also by document score.
Here is an example request:
{
"query": {
"query_string": {
"default_field": "product_title",
"query": "iphone 6"
}
},
"aggs": {
"product_categories": {
"terms": {
"field": "product_category",
"order": {
"max_score": "desc",
"_count": "desc"
},
"size": 3
},
"aggs": {
"max_score": {
"max": {
"script": "_score"
}
}
}
}
}
}
Please note the "order" property of the terms aggregation. It specifies a path to the max_score aggregation which in turn just returns the special _score field which disposes the score of each hit document of the query. It does ALSO use the frequency of each time via the "_count" property on second position.
This request will give you the three terms in the product_category field that are the best of "very frequent and from highly ranked documents". I cannot say more explicitly how the ranking is done. I noticed in preliminary experiments that the result does not monotonously enumerate document scores but may "jump over" a quite highly ranked document when it only includes terms of low frequency - which actually might be what you want for your usecase. The documentation for this kind of ordering is found here: http://www.elastic.co/guide/en/elasticsearch/reference/1.x/search-aggregations-bucket-terms-aggregation.html
There is also an example in the above linked documentation for ordering by multiple criteria and just says "The above will sort the countries buckets based on the average height among the female population and then by their doc_count in descending order". My impression was it could be some kind of harmonic mean or something. Perhaps better look for yourself whether you find the results of this approach useful.
The special case I spoke of at the beginning is when each document has exactly one value in the requested field. In this case, you actually get the top N terms for the top N (because N is equal) documents when you leave out the "_count" ordering.