Is it possible to sort, group and limit efficiently in Mongo with a pipeline?

前端 未结 1 1146
走了就别回头了
走了就别回头了 2021-01-13 12:35

Given Users with an index on age:

{ name: \'Bob\',
  age:   21  }

{ name: \'Cathy,
  age:   21  }

{ name: \'Joe\',
  age:   33  }

To get

相关标签:
1条回答
  • 2021-01-13 13:00

    To answer your first question: $group does not preserve the order. There are a open requests for changes which also highlight the backgrounds a little but it doesn't look like the product will be changed to preserve the input documents' order:

    • https://jira.mongodb.org/browse/SERVER-24799
    • https://jira.mongodb.org/browse/SERVER-4507
    • https://jira.mongodb.org/browse/SERVER-21022

    Two things can be said in general: You generally want to group first and then do the sorting. The reason being that sorting less elements (which the grouping generally produces) is going to be faster than sorting all input documents.

    Secondly, MongoDB is going to make sure to sort as efficiently and little as possible. The documentation states:

    When a $sort immediately precedes a $limit in the pipeline, the $sort operation only maintains the top n results as it progresses, where n is the specified limit, and MongoDB only needs to store n items in memory. This optimization still applies when allowDiskUse is true and the n items exceed the aggregation memory limit.

    So this code gets the job done in your case:

    collection.aggregate({
        $group: {
            _id: '$age',
            names: { $push: '$name' }
        }
    }, {
        $sort: { 
            '_id': 1 
        }
    }, {
        $limit: 10
    })
    

    EDIT following your comments:

    I agree to what you say. And taking your logic a little further, I would go as far as saying: If $group was smart enough to use an index then it shouldn't even require a $sort stage at the start. Unfortunately, it's not (not yet probably). As things stand today, $group will never use an index and it won't take shortcuts based on the following stages ($limit in this case). Also see this link where someone ran some basic tests.

    The aggregation framework is still pretty young so I guess, there is a lot of work being done to make the aggregation pipeline smarter and faster.

    There are answers here on StackOverflow (e.g. here) where people suggest to use an upfront $sort stage in order to "force" MongoDB to use an index somehow. This however, slowed down my tests (1 million records of your sample shape using different random distributions) significantly.

    When it comes to performance of an aggregation pipeline, $match stages at the start are what really helps the most. If you can limit the total amount of records that need to go through the pipeline from the beginning then that's your best bet - obviously... ;)

    0 讨论(0)
提交回复
热议问题