Random record from MongoDB

后端 未结 27 1966
栀梦
栀梦 2020-11-22 01:22

I am looking to get a random record from a huge (100 million record) mongodb.

What is the fastest and most efficient way to do so? The data is already t

27条回答
  •  伪装坚强ぢ
    2020-11-22 01:38

    I would suggest using map/reduce, where you use the map function to only emit when a random value is above a given probability.

    function mapf() {
        if(Math.random() <= probability) {
        emit(1, this);
        }
    }
    
    function reducef(key,values) {
        return {"documents": values};
    }
    
    res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": { "probability": 0.5}});
    printjson(res.results);
    

    The reducef function above works because only one key ('1') is emitted from the map function.

    The value of the "probability" is defined in the "scope", when invoking mapRreduce(...)

    Using mapReduce like this should also be usable on a sharded db.

    If you want to select exactly n of m documents from the db, you could do it like this:

    function mapf() {
        if(countSubset == 0) return;
        var prob = countSubset / countTotal;
        if(Math.random() <= prob) {
            emit(1, {"documents": [this]}); 
            countSubset--;
        }
        countTotal--;
    }
    
    function reducef(key,values) {
        var newArray = new Array();
    for(var i=0; i < values.length; i++) {
        newArray = newArray.concat(values[i].documents);
    }
    
    return {"documents": newArray};
    }
    
    res = db.questions.mapReduce(mapf, reducef, {"out": {"inline": 1}, "scope": {"countTotal": 4, "countSubset": 2}})
    printjson(res.results);
    

    Where "countTotal" (m) is the number of documents in the db, and "countSubset" (n) is the number of documents to retrieve.

    This approach might give some problems on sharded databases.

提交回复
热议问题