Find duplicate records in MongoDB

前端 未结 4 1812
伪装坚强ぢ
伪装坚强ぢ 2020-11-28 02:14

How would I find duplicate fields in a mongo collection.

I\'d like to check if any of the \"name\" fields are duplicates.

{
    \"name\" : \"ksqn291\         


        
相关标签:
4条回答
  • 2020-11-28 02:46
    db.getCollection('orders').aggregate([  
        {$group: { 
                _id: {name: "$name"},
                uniqueIds: {$addToSet: "$_id"},
                count: {$sum: 1}
            } 
        },
        {$match: { 
            count: {"$gt": 1}
            }
        }
    ])
    

    First Group Query the group according to the fields.

    Then we check the unique Id and count it, If count is greater then 1 then the field is duplicate in the entire collection so that thing is to be handle by $match query.

    0 讨论(0)
  • 2020-11-28 02:51

    The answer anhic gave can be very inefficient if you have a large database and the attribute name is present only in some of the documents.

    To improve efficiency you can add a $match to the aggregation.

    db.collection.aggregate(
        {"$match": {"name" :{ "$ne" : null } } }, 
        {"$group" : {"_id": "$name", "count": { "$sum": 1 } } },
        {"$match": {"count" : {"$gt": 1} } }, 
        {"$project": {"name" : "$_id", "_id" : 0} }
    )
    
    0 讨论(0)
  • 2020-11-28 02:52

    Use aggregation on name and get name with count > 1:

    db.collection.aggregate([
        {"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
        {"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } }, 
        {"$project": {"name" : "$_id", "_id" : 0} }
    ]);
    

    To sort the results by most to least duplicates:

    db.collection.aggregate([
        {"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
        {"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } }, 
        {"$sort": {"count" : -1} },
        {"$project": {"name" : "$_id", "_id" : 0} }     
    ]);
    

    To use with another column name than "name", change "$name" to "$column_name"

    0 讨论(0)
  • 2020-11-28 03:07

    You can find the list of duplicate names using the following aggregate pipeline:

    • Group all the records having similar name.
    • Match those groups having records greater than 1.
    • Then group again to project all the duplicate names as an array.

    The Code:

    db.collection.aggregate([
    {$group:{"_id":"$name","name":{$first:"$name"},"count":{$sum:1}}},
    {$match:{"count":{$gt:1}}},
    {$project:{"name":1,"_id":0}},
    {$group:{"_id":null,"duplicateNames":{$push:"$name"}}},
    {$project:{"_id":0,"duplicateNames":1}}
    ])
    

    o/p:

    { "duplicateNames" : [ "ksqn291", "ksqn29123213Test" ] }
    
    0 讨论(0)
提交回复
热议问题