Random record from MongoDB

后端未结

关注

 27  1966

栀梦 2020-11-22 01:22

I am looking to get a random record from a huge (100 million record) mongodb.

What is the fastest and most efficient way to do so? The data is already t

27条回答

伪装坚强ぢ (楼主)

2020-11-22 01:38

I would suggest using map/reduce, where you use the map function to only emit when a random value is above a given probability.

function mapf() {
    if(Math.random() <= probability) {
    emit(1, this);
    }
}

function reducef(key,values) {
    return {"documents": values};
}

res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": { "probability": 0.5}});
printjson(res.results);

The reducef function above works because only one key ('1') is emitted from the map function.

The value of the "probability" is defined in the "scope", when invoking mapRreduce(...)

Using mapReduce like this should also be usable on a sharded db.

If you want to select exactly n of m documents from the db, you could do it like this:

function mapf() {
    if(countSubset == 0) return;
    var prob = countSubset / countTotal;
    if(Math.random() <= prob) {
        emit(1, {"documents": [this]}); 
        countSubset--;
    }
    countTotal--;
}

function reducef(key,values) {
    var newArray = new Array();
for(var i=0; i < values.length; i++) {
    newArray = newArray.concat(values[i].documents);
}

return {"documents": newArray};
}

res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": {"countTotal": 4, "countSubset": 2}})
printjson(res.results);

Where "countTotal" (m) is the number of documents in the db, and "countSubset" (n) is the number of documents to retrieve.

This approach might give some problems on sharded databases.

0 讨论(0)

查看其它27个回答