Remove Duplicates on mongodb

前端 未结 1 1597
我寻月下人不归
我寻月下人不归 2021-01-23 08:30

I would like to remove duplicates on robomongo, my version 3.0.12 so I cant use DropDups,

{
    \"_id\" : ObjectId(\"id\"),
    \"Name\" : \"No One\",
    \"Sit         


        
相关标签:
1条回答
  • 2021-01-23 08:43

    if you are prepared to simply discard all other duplicates then you basically want to .aggregate() in order to collect the documents with the same RegisterNumber value and remove all other documents other than the first match.

    MongoDB 3.0.x lacks some of the modern helpers but the basics that .aggregate() returns a cursor for process large result sets and the presence of "bulk operations" for write performance still exists:

    var bulk = db.collection.initializeOrderedBulkOp();
    var count = 0;
    
    db.collection.aggregate([
      // Group on unique value storing _id values to array and count 
      { "$group": {
        "_id": "$RegisterNumber",
        "ids": { "$push": "$_id" },
        "count": { "$sum": 1 }      
      }},
      // Only return things that matched more than once. i.e a duplicate
      { "$match": { "count": { "$gt": 1 } } }
    ]).forEach(function(doc) {
      var keep = doc.ids.shift();     // takes the first _id from the array
    
      bulk.find({ "_id": { "$in": doc.ids }}).remove(); // remove all remaining _id matches
      count++;
    
      if ( count % 500 == 0 ) {  // only actually write per 500 operations
          bulk.execute();
          bulk = db.collection.initializeOrderedBulkOp();  // re-init after execute
      }
    });
    
    // Clear any queued operations
    if ( count % 500 != 0 )
        bulk.execute();
    

    In more modern releases ( 3.2 and above ) it is preferred to use bulkWrite() instead. Note that this is a 'client library' thing, as the same "bulk" methods shown above are actually called "under the hood":

    var ops = [];
    
    db.collection.aggregate([
      { "$group": {
        "_id": "$RegisterNumber",
        "ids": { "$push": "$id" },
        "count": { "$sum": 1 }      
      }},
      { "$match": { "count": { "$gt": 1 } } }
    ]).forEach( doc => {
    
      var keep = doc.ids.shift();
    
      ops = [
        ...ops,
        {
          "deleteMany": { "filter": { "_id": { "$in": doc.ids } } }
        }
      ];
    
      if (ops.length >= 500) {
        db.collection.bulkWrite(ops);
        ops = [];
      }
    });
    
    if (ops.length > 0)
      db.collection.bulkWrite(ops);
    

    So $group pulls everything together via the $RegisterNumber value and collects the matching document _id values to an array. You keep the count of how many times this happens using $sum.

    Then filter out any documents that only had a count of 1 since those are clearly not duplicates.

    Passing to the loop you remove the first occurance of _id in the collected list for the key with .shift(), leaving only other "duplicates" in the array.

    These are passed to the "remove" operation with $in as a "list" of documents to match and remove.

    The process is generally the same if you need something more complex such as merging details from the other duplicate documents, it's just that you might need more care if doing something like converting the case of the "unique key" and therefore actually removing the duplicates first before writing changes to the document to be modified.

    At any rate, the aggregation will highlight the documents that actually are "duplicates". The remaining processing logic is based on whatever you actually want to do with that information once you identify them.

    0 讨论(0)
提交回复
热议问题