I am looking to get a random record from a huge (100 million record) mongodb
.
What is the fastest and most efficient way to do so? The data is already t
it is tough if there is no data there to key off of. what are the _id field? are they mongodb object id's? If so, you could get the highest and lowest values:
lowest = db.coll.find().sort({_id:1}).limit(1).next()._id;
highest = db.coll.find().sort({_id:-1}).limit(1).next()._id;
then if you assume the id's are uniformly distributed (but they aren't, but at least it's a start):
unsigned long long L = first_8_bytes_of(lowest)
unsigned long long H = first_8_bytes_of(highest)
V = (H - L) * random_from_0_to_1();
N = L + V;
oid = N concat random_4_bytes();
randomobj = db.coll.find({_id:{$gte:oid}}).limit(1);
Here is a way using the default ObjectId values for _id
and a little math and logic.
// Get the "min" and "max" timestamp values from the _id in the collection and the
// diff between.
// 4-bytes from a hex string is 8 characters
var min = parseInt(db.collection.find()
.sort({ "_id": 1 }).limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
max = parseInt(db.collection.find()
.sort({ "_id": -1 })limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
diff = max - min;
// Get a random value from diff and divide/multiply be 1000 for The "_id" precision:
var random = Math.floor(Math.floor(Math.random(diff)*diff)/1000)*1000;
// Use "random" in the range and pad the hex string to a valid ObjectId
var _id = new ObjectId(((min + random)/1000).toString(16) + "0000000000000000")
// Then query for the single document:
var randomDoc = db.collection.find({ "_id": { "$gte": _id } })
.sort({ "_id": 1 }).limit(1).toArray()[0];
That's the general logic in shell representation and easily adaptable.
So in points:
Find the min and max primary key values in the collection
Generate a random number that falls between the timestamps of those documents.
Add the random number to the minimum value and find the first document that is greater than or equal to that value.
This uses "padding" from the timestamp value in "hex" to form a valid ObjectId
value since that is what we are looking for. Using integers as the _id
value is essentially simplier but the same basic idea in the points.
In Python using pymongo:
import random
def get_random_doc():
count = collection.count()
return collection.find()[random.randrange(count)]
You can pick a random timestamp and search for the first object that was created afterwards. It will only scan a single document, though it doesn't necessarily give you a uniform distribution.
var randRec = function() {
// replace with your collection
var coll = db.collection
// get unixtime of first and last record
var min = coll.find().sort({_id: 1}).limit(1)[0]._id.getTimestamp() - 0;
var max = coll.find().sort({_id: -1}).limit(1)[0]._id.getTimestamp() - 0;
// allow to pass additional query params
return function(query) {
if (typeof query === 'undefined') query = {}
var randTime = Math.round(Math.random() * (max - min)) + min;
var hexSeconds = Math.floor(randTime / 1000).toString(16);
var id = ObjectId(hexSeconds + "0000000000000000");
query._id = {$gte: id}
return coll.find(query).limit(1)
};
}();
Using Python (pymongo), the aggregate function also works.
collection.aggregate([{'$sample': {'size': sample_size }}])
This approach is a lot faster than running a query for a random number (e.g. collection.find([random_int]). This is especially the case for large collections.
The following aggregation operation randomly selects 3 documents from the collection:
db.users.aggregate( [ { $sample: { size: 3 } } ] )
https://docs.mongodb.com/manual/reference/operator/aggregation/sample/