Given Users with an index on age:
{ name: \'Bob\',
age: 21 }
{ name: \'Cathy,
age: 21 }
{ name: \'Joe\',
age: 33 }
To get
To answer your first question: $group
does not preserve the order. There are a open requests for changes which also highlight the backgrounds a little but it doesn't look like the product will be changed to preserve the input documents' order:
Two things can be said in general: You generally want to group first and then do the sorting. The reason being that sorting less elements (which the grouping generally produces) is going to be faster than sorting all input documents.
Secondly, MongoDB is going to make sure to sort as efficiently and little as possible. The documentation states:
When a $sort immediately precedes a $limit in the pipeline, the $sort operation only maintains the top n results as it progresses, where n is the specified limit, and MongoDB only needs to store n items in memory. This optimization still applies when allowDiskUse is true and the n items exceed the aggregation memory limit.
So this code gets the job done in your case:
collection.aggregate({
$group: {
_id: '$age',
names: { $push: '$name' }
}
}, {
$sort: {
'_id': 1
}
}, {
$limit: 10
})
EDIT following your comments:
I agree to what you say. And taking your logic a little further, I would go as far as saying: If $group
was smart enough to use an index then it shouldn't even require a $sort
stage at the start. Unfortunately, it's not (not yet probably). As things stand today, $group
will never use an index and it won't take shortcuts based on the following stages ($limit
in this case). Also see this link where someone ran some basic tests.
The aggregation framework is still pretty young so I guess, there is a lot of work being done to make the aggregation pipeline smarter and faster.
There are answers here on StackOverflow (e.g. here) where people suggest to use an upfront $sort
stage in order to "force" MongoDB to use an index somehow. This however, slowed down my tests (1 million records of your sample shape using different random distributions) significantly.
When it comes to performance of an aggregation pipeline, $match
stages at the start are what really helps the most. If you can limit the total amount of records that need to go through the pipeline from the beginning then that's your best bet - obviously... ;)