Removing duplicate records using MapReduce

后端未结

关注

 4  789

I\'m using MongoDB and need to remove duplicate records. I have a listing collection that looks like so: (simplified)

[
  { \"MlsId\": \"12345\"\" },
  { \"M


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  遇见更好的自我        
                
              
                            
                2020-12-10 06:38
              
            
            
                                                                       
In mongodb you can use a query to restrict documents that are passed in for mapping. You probably want to do that for the ones you don't care about. Then in the reduce function you can ignore the dups and only return one of the docs for each duplicate key.

I'm a little confused about your goal though. If you just want to find duplicates and remove all but one of them then you can just create a unique index on that field and use the dropDups option; the process of creating the index will drop duplicate docs. Keeping the index will ensure that it doesn't happen again.

http://www.mongodb.org/display/DOCS/Indexes#Indexes-DuplicateValues
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  时光取名叫无心        
                
              
                            
                2020-12-10 06:43
              
            
            
                                                                       
You can use aggregation operation to remove duplicates. Unwind, introduce a dummy $group and $sum stage and ignore the counts in your next stage. Something like this,

db.myCollection.aggregate([
 {
     $unwind: '$list'
 },
 {
    $group:{
   '_id':
       {
         'listing_id':'$_id', 'MlsId':'$list.MlsId'
       },
          'count':
       {
          '$sum':1
       }
      }
},
{
      $group:
       {
        '_id':'$_id.listing_id',
        'list':
         {
          '$addToSet':
           {
            'MlsId':'$_id.MlsId'
           }
         }
       }
}
]);

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  有刺的猬        
                
              
                            
                2020-12-10 06:46
              
            
            
                                                                       
I have not used mongoDB but I have used mapreduce. I think you are on the right track in terms of the mapreduce functions. To exclude he 0 and empty strings, you can add a check in the map function itself.. something like 

m = function () { 
  if(this.MlsId!=0 && this.MlsId!="") {    
    emit(this.MlsId, 1); 
  }
} 


And reduce should return key-value pairs. So it should be:

r = function(k, vals) {
  emit(k,Arrays.sum(vals);
}


After this, you should have a set of key-value pairs in output such that the key is MlsId and the value is the number of thimes this particular ID occurs. I am not sure about the db.drop() part. As you pointed out, it will most probably delete all MlsIds instead of removing only the duplicate ones. To get around this, maybe you can call drop() first and then recreate the MlsId once. Will that work for you?
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  猫巷女王i        
                
              
                            
                2020-12-10 06:49
              
            
            
                                                                       
this is how I following the @harri answer to remove duplicates:

//contains duplicated documents id and numeber of duplicates 
db.createCollection("myDupesCollection")
res = db.sampledDB.mapReduce(m, r, { out : "myDupesCollection" });

// iterate through duplicated docs and remove duplicates (keep one) 
db.myDupesCollection.find({value: {$gt: 1}}).forEach(function(myDoc){
    u_id = myDoc._id.MlsId;
    counts =myDoc.value;
    db.sampledDB.remove({MlsId: u_id},counts-1); //if there are 3 docs, remove 3-1=2 of them
});

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复