How to remove duplicates based on a key in Mongodb?

后端 未结 8 746
伪装坚强ぢ
伪装坚强ぢ 2020-11-30 20:56

I have a collection in MongoDB where there are around (~3 million records). My sample record would look like,

 { \"_id\" = ObjectId(\"50731xxxxxxxxxxxxxxxxxx         


        
相关标签:
8条回答
  • 2020-11-30 21:24

    If you have enough memory, you can in scala do something like that:

    cole.find().groupBy(_.customField).filter(_._2.size>1).map(_._2.tail).flatten.map(_.id)
    .foreach(x=>cole.remove({id $eq x})
    
    0 讨论(0)
  • 2020-11-30 21:31

    This answer is obsolete : the dropDups option was removed in MongoDB 3.0, so a different approach will be required in most cases. For example, you could use aggregation as suggested on: MongoDB duplicate documents even after adding unique key.

    If you are certain that the source_references.key identifies duplicate records, you can ensure a unique index with the dropDups:true index creation option in MongoDB 2.6 or older:

    db.things.ensureIndex({'source_references.key' : 1}, {unique : true, dropDups : true})
    

    This will keep the first unique document for each source_references.key value, and drop any subsequent documents that would otherwise cause a duplicate key violation.

    Important Note: Any documents missing the source_references.key field will be considered as having a null value, so subsequent documents missing the key field will be deleted. You can add the sparse:true index creation option so the index only applies to documents with a source_references.key field.

    Obvious caution: Take a backup of your database, and try this in a staging environment first if you are concerned about unintended data loss.

    0 讨论(0)
提交回复
热议问题