How to remove duplicates based on a key in Mongodb?

后端 未结 8 745
伪装坚强ぢ
伪装坚强ぢ 2020-11-30 20:56

I have a collection in MongoDB where there are around (~3 million records). My sample record would look like,

 { \"_id\" = ObjectId(\"50731xxxxxxxxxxxxxxxxxx         


        
相关标签:
8条回答
  • 2020-11-30 21:06

    Here is a slightly more 'manual' way of doing it:

    Essentially, first, get a list of all the unique keys you are interested.

    Then perform a search using each of those keys and delete if that search returns bigger than one.

        db.collection.distinct("key").forEach((num)=>{
          var i = 0;
          db.collection.find({key: num}).forEach((doc)=>{
            if (i)   db.collection.remove({key: num}, { justOne: true })
            i++
          })
        });
    
    0 讨论(0)
  • 2020-11-30 21:06

    Expanding on Fernando's answer, I found that it was taking too long, so I modified it.

    var x = 0;
    db.collection.distinct("field").forEach(fieldValue => {
      var i = 0;
      db.collection.find({ "field": fieldValue }).forEach(doc => {
        if (i) {
          db.collection.remove({ _id: doc._id });
        }
        i++;
        x += 1;
        if (x % 100 === 0) {
          print(x); // Every time we process 100 docs.
        }
      });
    });
    

    The improvement is basically using the document id for removing, which should be faster, and also adding the progress of the operation, you can change the iteration value to your desired amount.

    Also, indexing the field before the operation helps.

    0 讨论(0)
  • 2020-11-30 21:15

    While @Stennie's is a valid answer, it is not the only way. Infact the MongoDB manual asks you to be very cautious while doing that. There are two other options

    1. Let the MongoDB do that for you using Map Reduce
      • Another way
    2. You do programatically which is less efficient.
    0 讨论(0)
  • 2020-11-30 21:16

    This is the easiest query I used on my MongoDB 3.2

    db.myCollection.find({}, {myCustomKey:1}).sort({_id:1}).forEach(function(doc){
        db.myCollection.remove({_id:{$gt:doc._id}, myCustomKey:doc.myCustomKey});
    })
    

    Index your customKey before running this to increase speed

    0 讨论(0)
  • 2020-11-30 21:20

    I had a similar requirement but I wanted to retain the latest entry. The following query worked withmy collections with millions of records and duplicates.

    /** Create a array to store all duplicate records ids*/
    var duplicates = [];
    
    /** Start Aggregation pipeline*/
    db.collection.aggregate([
      {
        $match: { /** Add any filter here. Add index for filter keys*/
          filterKey: {
            $exists: false
          }
        }
      },
      {
        $sort: { /** Sort it in such a way that you want to retain first element*/
          createdAt: -1
        }
      },
      {
        $group: {
          _id: {
            key1: "$key1", key2:"$key2" /** These are the keys which define the duplicate. Here document with same value for key1 and key2 will be considered duplicate*/
          },
          dups: {
            $push: {
              _id: "$_id"
            }
          },
          count: {
            $sum: 1
          }
        }
      },
      {
        $match: {
          count: {
            "$gt": 1
          }
        }
      }
    ],
    {
      allowDiskUse: true
    }).forEach(function(doc){
      doc.dups.shift();
      doc.dups.forEach(function(dupId){
        duplicates.push(dupId._id);
      })
    })
    
    /** Delete the duplicates*/
    var i,j,temparray,chunk = 100000;
    for (i=0,j=duplicates.length; i<j; i+=chunk) {
        temparray = duplicates.slice(i,i+chunk);
        db.collection.bulkWrite([{deleteMany:{"filter":{"_id":{"$in":temparray}}}}])
    }
    
    0 讨论(0)
  • 2020-11-30 21:21

    pip install mongo_remove_duplicate_indexes

    1. create a script in any language
    2. iterate over your collection
    3. create new collection and create new index in this collection with unique set to true ,remember this index has to be same as index u wish to remove duplicates from in ur original collection with same name for ex-u have a collection gaming,and in this collection u have field genre which contains duplicates,which u wish to remove,so just create new collection db.createCollection("cname") create new index db.cname.createIndex({'genre':1},unique:1) now when u will insert document with similar genre only first will be accepted,other will be rejected with duplicae key error
    4. now just insert the json format values u received into new collection and handle exception using exception handling for ex pymongo.errors.DuplicateKeyError

    check out the package source code for the mongo_remove_duplicate_indexes for better understanding

    0 讨论(0)
提交回复
热议问题