Remove Duplicates on mongodb

前端未结

关注

 1  1597

I would like to remove duplicates on robomongo, my version 3.0.12 so I cant use DropDups,

{
    \"_id\" : ObjectId(\"id\"),
    \"Name\" : \"No One\",
    \"Sit


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  心在旅途        
                
              
                            
                2021-01-23 08:43
              
            
            
                                                                       


if you are prepared to simply discard all other duplicates then you basically want to .aggregate() in order to collect the documents with the same RegisterNumber value and remove all other documents other than the first match.

MongoDB 3.0.x lacks some of the modern helpers but the basics that .aggregate() returns a cursor for process large result sets and the presence of "bulk operations" for write performance still exists:

var bulk = db.collection.initializeOrderedBulkOp();
var count = 0;

db.collection.aggregate([
  // Group on unique value storing _id values to array and count 
  { "$group": {
    "_id": "$RegisterNumber",
    "ids": { "$push": "$_id" },
    "count": { "$sum": 1 }      
  }},
  // Only return things that matched more than once. i.e a duplicate
  { "$match": { "count": { "$gt": 1 } } }
]).forEach(function(doc) {
  var keep = doc.ids.shift();     // takes the first _id from the array

  bulk.find({ "_id": { "$in": doc.ids }}).remove(); // remove all remaining _id matches
  count++;

  if ( count % 500 == 0 ) {  // only actually write per 500 operations
      bulk.execute();
      bulk = db.collection.initializeOrderedBulkOp();  // re-init after execute
  }
});

// Clear any queued operations
if ( count % 500 != 0 )
    bulk.execute();


In more modern releases ( 3.2 and above ) it is preferred to use bulkWrite() instead. Note that this is a 'client library' thing, as the same "bulk" methods shown above are actually called "under the hood":

var ops = [];

db.collection.aggregate([
  { "$group": {
    "_id": "$RegisterNumber",
    "ids": { "$push": "$id" },
    "count": { "$sum": 1 }      
  }},
  { "$match": { "count": { "$gt": 1 } } }
]).forEach( doc => {

  var keep = doc.ids.shift();

  ops = [
    ...ops,
    {
      "deleteMany": { "filter": { "_id": { "$in": doc.ids } } }
    }
  ];

  if (ops.length >= 500) {
    db.collection.bulkWrite(ops);
    ops = [];
  }
});

if (ops.length > 0)
  db.collection.bulkWrite(ops);


So $group pulls everything together via the $RegisterNumber value and collects the matching document _id values to an array. You keep the count of how many times this happens using $sum.

Then filter out any documents that only had a count of 1 since those are clearly not duplicates.

Passing to the loop you remove the first occurance of _id in the collected list for the key with .shift(), leaving only other "duplicates" in the array.

These are passed to the "remove" operation with $in as a "list" of documents to match and remove.

The process is generally the same if you need something more complex such as merging details from the other duplicate documents, it's just that you might need more care if doing something like converting the case of the "unique key" and therefore actually removing the duplicates first before writing changes to the document to be modified.

At any rate, the aggregation will highlight the documents that actually are "duplicates". The remaining processing logic is based on whatever you actually want to do with that information once you identify them.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复