问题
I have two fields 'company' and 'url'. I want to sort it by number of times distinct 'company' is occurring and then display three 'url' corresponding to that particular company. Data is stored like this:
{
"_id" : ObjectId("56c4f73664af6f7305f3670f"),
"title" : "Full Stack Software Developer",
"url" : "http://www.indeed.com/cmp/Upside-Commerce,-Inc./jobs/Full-Stack-Software-Developer-6e93e36ea5d0e57e?sjdu=QwrRXKrqZ3CNX5W-O9jEvRQls7y2xdBHzhqWkvhd5FFfs8wS9wesfMWXjNNFaUXen2pO-kyc_Qbr7-_3Gf40AvyEQT3jn6IRxIwvw9-aFy8",
"company" : "Upside Commerce, Inc."
}
following query counts the number of distinct companies.
db.Books.aggregate({$group : { _id : '$company', count : {$sum : 1}}})
Following is the output:
{ "_id" : "Microsoft", "count" : 14 }
{ "_id" : "Tableau", "count" : 64 }
{ "_id" : "Amazon", "count" : 64 }
{ "_id" : "Dropbox", "count" : 64 }
{ "_id" : "Amazon Corporate LLC", "count" : 64 }
{ "_id" : "Electronic Arts", "count" : 64 }
{ "_id" : "CDK Global", "count" : 65 }
{ "_id" : "IDC Technologies", "count" : 64 }
{ "_id" : "Concur", "count" : 64 }
{ "_id" : "Microsoft", "count" : 14 }
{ "_id" : "Tableau", "count" : 64 }
{ "_id" : "Amazon", "count" : 64 }
{ "_id" : "Dropbox", "count" : 64 }
{ "_id" : "Amazon Corporate LLC", "count" : 64 }
{ "_id" : "Electronic Arts", "count" : 64 }
{ "_id" : "CDK Global", "count" : 65 }
{ "_id" : "IDC Technologies", "count" : 64 }
{ "_id" : "Concur", "count" : 64 }
However I want it sort by count of distinct companies (limit it to Top 10 highest occurring companies) and then display three urls corresponding to distinct company (if count for distinct company is atleast three). Something like:
{for microsoft:
{"url" : "https://careers.microsoft.com/jobdetails.aspx?jid=216571&memid=1071484607&utm_source=Indeed"}
{"url" : "https://careers.microsoft.com/jobdetails.aspx?jid=216571&memid=1695844082&utm_source=Indeed" }
{ "url" : "https://careers.microsoft.com/jobdetails.aspx?jid=216571&memid=932148152&utm_source=Indeed"}}
Same goes for other companies
回答1:
This really is (still) best handled by multiple queries, since MongoDB really "still" does not have the really efficient operators to do this yet.
You can do something like this with MongoDB 3.2 though, but there are obvious "catches":
db.Books.aggregate([
{ "$group": {
"_id": "$company",
"count": { "$sum": 1 },
"urls": {
"$push": "$url"
}
}},
{ "$sort": { "count": -1 } },
{ "$limit": 10 },
{ "$project": {
"count": 1,
"urls": { "$slice": ["$urls",0, 3] }
}}
])
And the obvious problem is that no matter what, you are still adding all of the "url" content into the grouped array. This has the potential to exceed the BSON limit of 16MB. It might not, but it's still a bit wasteful to add "all" content when you only want "three" of them.
So even then it's probably more practical to just actually query for the "urls" seperately on each of the top 10 results.
Here's a listing for node.js that demonstrates:
var async = require('async'),
mongodb = require('mongodb'),
MongoClient = mongodb.MongoClient;
MongoClient.connect("mongodb://localhost/test",function(err,db) {
if (err) throw err;
// Get the top 10
db.collection("Books").aggregate(
[
{ "$group": {
"_id": "$company",
"count": { "$sum": 1 }
}},
{ "$sort": { "count": -1 } },
{ "$limit": 10 }
],function(err,results) {
if (err) throw err;
// Query for each result and map query response as urls
async.map(
results,
function(result,callback) {
db.collection("Books").find({
"company": result.company
}).limit(3).toArray(function(err,items) {
result.urls = items.map(function(item) {
return item.url;
});
callback(err,result);
})
},
function(err,results) {
if (err) throw err;
// each result entry has 3 urls
}
);
}
)
});
Yes it's more calls to the database, but it really is only ten and therefore not really an issue.
The real resolution for this is covered in SERVER-9377 - Extend $push or $max to allow collecting "top" N values per _id key in $group phase. This has the promising "In Progress" status, so it is actively being worked on.
Once that is resolved, then a single aggregation statement becomes viable, since then you would be able to "limit" the resulting "urls" in the intial $push
to just three entries, rather than remove all but three after the fact.
来源:https://stackoverflow.com/questions/35929411/mongo-query-to-sort-by-distinct-count