Random record from MongoDB

后端 未结 27 1963
栀梦
栀梦 2020-11-22 01:22

I am looking to get a random record from a huge (100 million record) mongodb.

What is the fastest and most efficient way to do so? The data is already t

相关标签:
27条回答
  • 2020-11-22 01:38

    I would suggest using map/reduce, where you use the map function to only emit when a random value is above a given probability.

    function mapf() {
        if(Math.random() <= probability) {
        emit(1, this);
        }
    }
    
    function reducef(key,values) {
        return {"documents": values};
    }
    
    res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": { "probability": 0.5}});
    printjson(res.results);
    

    The reducef function above works because only one key ('1') is emitted from the map function.

    The value of the "probability" is defined in the "scope", when invoking mapRreduce(...)

    Using mapReduce like this should also be usable on a sharded db.

    If you want to select exactly n of m documents from the db, you could do it like this:

    function mapf() {
        if(countSubset == 0) return;
        var prob = countSubset / countTotal;
        if(Math.random() <= prob) {
            emit(1, {"documents": [this]}); 
            countSubset--;
        }
        countTotal--;
    }
    
    function reducef(key,values) {
        var newArray = new Array();
    for(var i=0; i < values.length; i++) {
        newArray = newArray.concat(values[i].documents);
    }
    
    return {"documents": newArray};
    }
    
    res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": {"countTotal": 4, "countSubset": 2}})
    printjson(res.results);
    

    Where "countTotal" (m) is the number of documents in the db, and "countSubset" (n) is the number of documents to retrieve.

    This approach might give some problems on sharded databases.

    0 讨论(0)
  • 2020-11-22 01:38

    My PHP/MongoDB sort/order by RANDOM solution. Hope this helps anyone.

    Note: I have numeric ID's within my MongoDB collection that refer to a MySQL database record.

    First I create an array with 10 randomly generated numbers

        $randomNumbers = [];
        for($i = 0; $i < 10; $i++){
            $randomNumbers[] = rand(0,1000);
        }
    

    In my aggregation I use the $addField pipeline operator combined with $arrayElemAt and $mod (modulus). The modulus operator will give me a number from 0 - 9 which I then use to pick a number from the array with random generated numbers.

        $aggregate[] = [
            '$addFields' => [
                'random_sort' => [ '$arrayElemAt' => [ $randomNumbers, [ '$mod' => [ '$my_numeric_mysql_id', 10 ] ] ] ],
            ],
        ];
    

    After that you can use the sort Pipeline.

        $aggregate[] = [
            '$sort' => [
                'random_sort' => 1
            ]
        ];
    
    0 讨论(0)
  • 2020-11-22 01:39

    Starting with the 3.2 release of MongoDB, you can get N random docs from a collection using the $sample aggregation pipeline operator:

    // Get one random document from the mycoll collection.
    db.mycoll.aggregate([{ $sample: { size: 1 } }])
    

    If you want to select the random document(s) from a filtered subset of the collection, prepend a $match stage to the pipeline:

    // Get one random document matching {a: 10} from the mycoll collection.
    db.mycoll.aggregate([
        { $match: { a: 10 } },
        { $sample: { size: 1 } }
    ])
    

    As noted in the comments, when size is greater than 1, there may be duplicates in the returned document sample.

    0 讨论(0)
  • 2020-11-22 01:40

    I'd suggest adding a random int field to each object. Then you can just do a

    findOne({random_field: {$gte: rand()}}) 
    

    to pick a random document. Just make sure you ensureIndex({random_field:1})

    0 讨论(0)
  • 2020-11-22 01:41

    My solution on php:

    /**
     * Get random docs from Mongo
     * @param $collection
     * @param $where
     * @param $fields
     * @param $limit
     * @author happy-code
     * @url happy-code.com
     */
    private function _mongodb_get_random (MongoCollection $collection, $where = array(), $fields = array(), $limit = false) {
    
        // Total docs
        $count = $collection->find($where, $fields)->count();
    
        if (!$limit) {
            // Get all docs
            $limit = $count;
        }
    
        $data = array();
        for( $i = 0; $i < $limit; $i++ ) {
    
            // Skip documents
            $skip = rand(0, ($count-1) );
            if ($skip !== 0) {
                $doc = $collection->find($where, $fields)->skip($skip)->limit(1)->getNext();
            } else {
                $doc = $collection->find($where, $fields)->limit(1)->getNext();
            }
    
            if (is_array($doc)) {
                // Catch document
                $data[ $doc['_id']->{'$id'} ] = $doc;
                // Ignore current document when making the next iteration
                $where['_id']['$nin'][] = $doc['_id'];
            }
    
            // Every iteration catch document and decrease in the total number of document
            $count--;
    
        }
    
        return $data;
    }
    
    0 讨论(0)
  • 2020-11-22 01:41

    Using Map/Reduce, you can certainly get a random record, just not necessarily very efficiently depending on the size of the resulting filtered collection you end up working with.

    I've tested this method with 50,000 documents (the filter reduces it to about 30,000), and it executes in approximately 400ms on an Intel i3 with 16GB ram and a SATA3 HDD...

    db.toc_content.mapReduce(
        /* map function */
        function() { emit( 1, this._id ); },
    
        /* reduce function */
        function(k,v) {
            var r = Math.floor((Math.random()*v.length));
            return v[r];
        },
    
        /* options */
        {
            out: { inline: 1 },
            /* Filter the collection to "A"ctive documents */
            query: { status: "A" }
        }
    );
    

    The Map function simply creates an array of the id's of all documents that match the query. In my case I tested this with approximately 30,000 out of the 50,000 possible documents.

    The Reduce function simply picks a random integer between 0 and the number of items (-1) in the array, and then returns that _id from the array.

    400ms sounds like a long time, and it really is, if you had fifty million records instead of fifty thousand, this may increase the overhead to the point where it becomes unusable in multi-user situations.

    There is an open issue for MongoDB to include this feature in the core... https://jira.mongodb.org/browse/SERVER-533

    If this "random" selection was built into an index-lookup instead of collecting ids into an array and then selecting one, this would help incredibly. (go vote it up!)

    0 讨论(0)
提交回复
热议问题