Removing duplicate records using MapReduce

后端 未结 4 788
借酒劲吻你
借酒劲吻你 2020-12-10 06:09

I\'m using MongoDB and need to remove duplicate records. I have a listing collection that looks like so: (simplified)

[
  { \"MlsId\": \"12345\"\" },
  { \"M         


        
4条回答
  •  有刺的猬
    2020-12-10 06:46

    I have not used mongoDB but I have used mapreduce. I think you are on the right track in terms of the mapreduce functions. To exclude he 0 and empty strings, you can add a check in the map function itself.. something like

    m = function () { 
      if(this.MlsId!=0 && this.MlsId!="") {    
        emit(this.MlsId, 1); 
      }
    } 
    

    And reduce should return key-value pairs. So it should be:

    r = function(k, vals) {
      emit(k,Arrays.sum(vals);
    }
    

    After this, you should have a set of key-value pairs in output such that the key is MlsId and the value is the number of thimes this particular ID occurs. I am not sure about the db.drop() part. As you pointed out, it will most probably delete all MlsIds instead of removing only the duplicate ones. To get around this, maybe you can call drop() first and then recreate the MlsId once. Will that work for you?

提交回复
热议问题