Random record from MongoDB

后端 未结 27 1964
栀梦
栀梦 2020-11-22 01:22

I am looking to get a random record from a huge (100 million record) mongodb.

What is the fastest and most efficient way to do so? The data is already t

相关标签:
27条回答
  • 2020-11-22 01:42

    it is tough if there is no data there to key off of. what are the _id field? are they mongodb object id's? If so, you could get the highest and lowest values:

    lowest = db.coll.find().sort({_id:1}).limit(1).next()._id;
    highest = db.coll.find().sort({_id:-1}).limit(1).next()._id;
    

    then if you assume the id's are uniformly distributed (but they aren't, but at least it's a start):

    unsigned long long L = first_8_bytes_of(lowest)
    unsigned long long H = first_8_bytes_of(highest)
    
    V = (H - L) * random_from_0_to_1();
    N = L + V;
    oid = N concat random_4_bytes();
    
    randomobj = db.coll.find({_id:{$gte:oid}}).limit(1);
    
    0 讨论(0)
  • 2020-11-22 01:43

    Here is a way using the default ObjectId values for _id and a little math and logic.

    // Get the "min" and "max" timestamp values from the _id in the collection and the 
    // diff between.
    // 4-bytes from a hex string is 8 characters
    
    var min = parseInt(db.collection.find()
            .sort({ "_id": 1 }).limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
        max = parseInt(db.collection.find()
            .sort({ "_id": -1 })limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
        diff = max - min;
    
    // Get a random value from diff and divide/multiply be 1000 for The "_id" precision:
    var random = Math.floor(Math.floor(Math.random(diff)*diff)/1000)*1000;
    
    // Use "random" in the range and pad the hex string to a valid ObjectId
    var _id = new ObjectId(((min + random)/1000).toString(16) + "0000000000000000")
    
    // Then query for the single document:
    var randomDoc = db.collection.find({ "_id": { "$gte": _id } })
       .sort({ "_id": 1 }).limit(1).toArray()[0];
    

    That's the general logic in shell representation and easily adaptable.

    So in points:

    • Find the min and max primary key values in the collection

    • Generate a random number that falls between the timestamps of those documents.

    • Add the random number to the minimum value and find the first document that is greater than or equal to that value.

    This uses "padding" from the timestamp value in "hex" to form a valid ObjectId value since that is what we are looking for. Using integers as the _id value is essentially simplier but the same basic idea in the points.

    0 讨论(0)
  • 2020-11-22 01:44

    In Python using pymongo:

    import random
    
    def get_random_doc():
        count = collection.count()
        return collection.find()[random.randrange(count)]
    
    0 讨论(0)
  • 2020-11-22 01:47

    You can pick a random timestamp and search for the first object that was created afterwards. It will only scan a single document, though it doesn't necessarily give you a uniform distribution.

    var randRec = function() {
        // replace with your collection
        var coll = db.collection
        // get unixtime of first and last record
        var min = coll.find().sort({_id: 1}).limit(1)[0]._id.getTimestamp() - 0;
        var max = coll.find().sort({_id: -1}).limit(1)[0]._id.getTimestamp() - 0;
    
        // allow to pass additional query params
        return function(query) {
            if (typeof query === 'undefined') query = {}
            var randTime = Math.round(Math.random() * (max - min)) + min;
            var hexSeconds = Math.floor(randTime / 1000).toString(16);
            var id = ObjectId(hexSeconds + "0000000000000000");
            query._id = {$gte: id}
            return coll.find(query).limit(1)
        };
    }();
    
    0 讨论(0)
  • 2020-11-22 01:48

    Using Python (pymongo), the aggregate function also works.

    collection.aggregate([{'$sample': {'size': sample_size }}])
    

    This approach is a lot faster than running a query for a random number (e.g. collection.find([random_int]). This is especially the case for large collections.

    0 讨论(0)
  • 2020-11-22 01:48

    The following aggregation operation randomly selects 3 documents from the collection:

    db.users.aggregate( [ { $sample: { size: 3 } } ] )

    https://docs.mongodb.com/manual/reference/operator/aggregation/sample/

    0 讨论(0)
提交回复
热议问题