问题
I have a database composed by entries which correspond to work contracts. In the MongoDB database I have aggregated by specific worker, then the database - in a simplified version - looks like something like that.
{
"_id" : ObjectId("5ea995662a40c63b14266071"),
"worker" : "1070",
"employer" : "2116096",
"start" : ISODate("2018-01-11T01:00:00.000+01:00"),
"ord_id" : 0
},
{
"_id" : ObjectId("5ea995662a40c63b14266071"),
"worker" : "1070",
"employer" : "2116096",
"start" : ISODate("2018-01-11T01:00:00.000+01:00"),
"ord_id" : 1
},
{
"_id" : ObjectId("5ea995662a40c63b14266072"),
"worker" : "1071",
"employer" : "2116055",
"start" : ISODate("2019-01-03T01:00:00.000+01:00"),
"ord_id" : 2
},
{
"_id" : ObjectId("5ea995662a40c63b14266072"),
"worker" : "1071",
"employer" : "2116056",
"start" : ISODate("2019-01-03T01:00:00.000+01:00"),
"ord_id" : 3
},
I have rearranged based on workers
{
"_id" : ObjectId("5ea995662a40c63b14266071"),
"worker" : "1070",
"contratcs" : [
{
"employer" : "2116096",
"start" : ISODate("2018-01-11T01:00:00.000+01:00"),
"ord_id" : 0
},
{
"employer" : "2116096",
"start" : ISODate("2018-01-11T01:00:00.000+01:00"),
"ord_id" : 1
} // Since employer identification and starting date is the same of the previous, this is a duplicate!
]
},
{
"_id" : ObjectId("5ea995662a40c63b14266072"),
"worker" : "1701",
"contratcs" : [
{
"employer" : "2116055",
"start" : ISODate("2019-01-03T01:00:00.000+01:00"),
"ord_id" : 2
},
{
"employer" : "2116056",
"start" : ISODate("2019-01-04T01:00:00.000+01:00"),
"ord_id" : 3
}
]
}
From the original table some contracts has been doubled checked, hence I have to preserve only one. More specifically (in the example), I consider duplicates those contracts (for the same worker) started on the same day and with the same employer. However, there should be a proper choice of which duplicate preserve and which not (it does not depend on me). Substantially, there is a field named 'ord_id' (I have generated generating the database into MongoDB) which is a number and is unique (hence, among duplicates, it is the only term that actually differs). Substantially, I have to preserve, among duplicates, those with the highest valued of 'ord_id'. By following this thread I wrote:
db.mycollection.aggregate([
{ $unwind: "$contracts" },
{ $group: {
_id: { WORKER: "$worker", START: "$contracts.start" },
dups: { $addToSet: "$_id" },
ord_id: { $addToSet: "$contracts.ord_id" },
count: {$sum: 1 }
}
},
{ $match: { count: { $gt: 1} } },
{ $sort: {count: -1, ord_id: -1 } }
],{allowDiskUse: true}).
forEach(function(doc) {
doc.dups.shift();
db.mycollection.remove({_id : {$in: doc.dups }});
});
Despite the fact that I am facing problems in eliminating when I aggregate by contracts, I would like to shift (then preserve) of the duplicates the one with the highest value of 'ord_id'. I am still new in MongoDB and still in a phase of mental switching from a mostly relational (SQL) approach. Apologize for the silly question.
回答1:
This aggregation will return the desired result - eliminates the duplicate based on worker+employer+start contracts
, and preserves only the contract with the highest ord_id
(of the duplicates).
db.collection.aggregate( [
{
$unwind: "$contracts"
},
{
$group: {
_id: { worker: "$worker", employer: "$contracts.employer", start: "$contracts.start" },
max_ord: { $max: "$contracts.ord_id" },
doc: { $first: "$$ROOT" }
}
},
{
$group: {
_id: { _id: "$doc._id", worker: "$doc.worker" },
contracts: { $push: { employer: "$_id.employer", start: "$_id.start", ord_id: "$ords" } }
}
},
{
$addFields: {
_id: "$_id._id",
worker: "$_id.worker"
}
}
] )
回答2:
If you reverse sort by ord_id
, you can use $first
in the $group
stage to select the highest value. This example will return the entire document in doc
, as well as the count of duplicates:
db.mycollection.aggregate([
{ $unwind: "$contracts" },
{ $sort: {"$contracts.ord_id":-1}},
{ $group: {
_id: { WORKER: "$worker", START: "$contracts.start", EMPLOYER: "$contracts.employer" },
doc: { $first: "$$ROOT" },
count: {$sum: 1 }
}}
],{allowDiskUse: true})
来源:https://stackoverflow.com/questions/61508338/eliminate-duplicates-in-mongodb-with-a-specific-sort