I would like to remove duplicates on robomongo, my version 3.0.12 so I cant use DropDups,
{
\"_id\" : ObjectId(\"id\"),
\"Name\" : \"No One\",
\"Sit
if you are prepared to simply discard all other duplicates then you basically want to .aggregate() in order to collect the documents with the same RegisterNumber
value and remove all other documents other than the first match.
MongoDB 3.0.x lacks some of the modern helpers but the basics that .aggregate() returns a cursor for process large result sets and the presence of "bulk operations" for write performance still exists:
var bulk = db.collection.initializeOrderedBulkOp();
var count = 0;
db.collection.aggregate([
// Group on unique value storing _id values to array and count
{ "$group": {
"_id": "$RegisterNumber",
"ids": { "$push": "$_id" },
"count": { "$sum": 1 }
}},
// Only return things that matched more than once. i.e a duplicate
{ "$match": { "count": { "$gt": 1 } } }
]).forEach(function(doc) {
var keep = doc.ids.shift(); // takes the first _id from the array
bulk.find({ "_id": { "$in": doc.ids }}).remove(); // remove all remaining _id matches
count++;
if ( count % 500 == 0 ) { // only actually write per 500 operations
bulk.execute();
bulk = db.collection.initializeOrderedBulkOp(); // re-init after execute
}
});
// Clear any queued operations
if ( count % 500 != 0 )
bulk.execute();
In more modern releases ( 3.2 and above ) it is preferred to use bulkWrite() instead. Note that this is a 'client library' thing, as the same "bulk" methods shown above are actually called "under the hood":
var ops = [];
db.collection.aggregate([
{ "$group": {
"_id": "$RegisterNumber",
"ids": { "$push": "$id" },
"count": { "$sum": 1 }
}},
{ "$match": { "count": { "$gt": 1 } } }
]).forEach( doc => {
var keep = doc.ids.shift();
ops = [
...ops,
{
"deleteMany": { "filter": { "_id": { "$in": doc.ids } } }
}
];
if (ops.length >= 500) {
db.collection.bulkWrite(ops);
ops = [];
}
});
if (ops.length > 0)
db.collection.bulkWrite(ops);
So $group pulls everything together via the $RegisterNumber
value and collects the matching document _id
values to an array. You keep the count of how many times this happens using $sum.
Then filter out any documents that only had a count of 1
since those are clearly not duplicates.
Passing to the loop you remove the first occurance of _id
in the collected list for the key with .shift()
, leaving only other "duplicates" in the array.
These are passed to the "remove" operation with $in as a "list" of documents to match and remove.
The process is generally the same if you need something more complex such as merging details from the other duplicate documents, it's just that you might need more care if doing something like converting the case of the "unique key" and therefore actually removing the duplicates first before writing changes to the document to be modified.
At any rate, the aggregation will highlight the documents that actually are "duplicates". The remaining processing logic is based on whatever you actually want to do with that information once you identify them.