How would I find duplicate fields in a mongo collection.
I\'d like to check if any of the \"name\" fields are duplicates.
{
\"name\" : \"ksqn291\
db.getCollection('orders').aggregate([
{$group: {
_id: {name: "$name"},
uniqueIds: {$addToSet: "$_id"},
count: {$sum: 1}
}
},
{$match: {
count: {"$gt": 1}
}
}
])
First Group Query the group according to the fields.
Then we check the unique Id and count it, If count is greater then 1 then the field is duplicate in the entire collection so that thing is to be handle by $match query.
The answer anhic gave can be very inefficient if you have a large database and the attribute name is present only in some of the documents.
To improve efficiency you can add a $match to the aggregation.
db.collection.aggregate(
{"$match": {"name" :{ "$ne" : null } } },
{"$group" : {"_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"count" : {"$gt": 1} } },
{"$project": {"name" : "$_id", "_id" : 0} }
)
Use aggregation on name
and get name
with count > 1
:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
{"$project": {"name" : "$_id", "_id" : 0} }
]);
To sort the results by most to least duplicates:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
{"$sort": {"count" : -1} },
{"$project": {"name" : "$_id", "_id" : 0} }
]);
To use with another column name than "name", change "$name" to "$column_name"
You can find the list
of duplicate
names using the following aggregate
pipeline:
Group
all the records having similar name
.Match
those groups
having records greater than 1
.group
again to project
all the duplicate names as an array
.The Code:
db.collection.aggregate([
{$group:{"_id":"$name","name":{$first:"$name"},"count":{$sum:1}}},
{$match:{"count":{$gt:1}}},
{$project:{"name":1,"_id":0}},
{$group:{"_id":null,"duplicateNames":{$push:"$name"}}},
{$project:{"_id":0,"duplicateNames":1}}
])
o/p:
{ "duplicateNames" : [ "ksqn291", "ksqn29123213Test" ] }