I have a collection full of documents with a created_date attribute. I\'d like to send these documents through an aggregation pipeline to do some work on them. Ideally I would
Try this;
db.createCollection("so");
db.so.remove();
db.so.insert([
{
post_body: 'This is the body of test post 1',
created_date: ISODate('2012-09-29T05:23:41Z'),
comments: 48
},
{
post_body: 'This is the body of test post 2',
created_date: ISODate('2012-09-24T12:34:13Z'),
comments: 10
},
{
post_body: 'This is the body of test post 3',
created_date: ISODate('2012-08-16T12:34:13Z'),
comments: 10
}
]);
//db.so.find();
db.so.ensureIndex({"created_date":1});
db.runCommand({
aggregate:"so",
pipeline:[
{
$match: { // filter only those posts in september
created_date: { $gte: ISODate('2012-09-01'), $lt: ISODate('2012-10-01') }
}
},
{
$group: {
_id: null, // no shared key
comments: { $sum: "$comments" } // total comments for all the posts in the pipeline
}
},
]
//,explain:true
});
Result is;
{ "result" : [ { "_id" : null, "comments" : 58 } ], "ok" : 1 }
So you could also modify your previous example to do this, although I'm not sure why you'd want to, unless you plan on doing something else with month and year in the pipeline;
{
aggregate: 'posts',
pipeline: [
{$match: { created_date: { $gte: ISODate('2012-09-01'), $lt: ISODate('2012-10-01') } } },
{$project:
{
month : {$month:'$created_date'},
year : {$year:'$created_date'}
}
},
{$match:
{
month:9,
year: 2012
}
},
{$group:
{_id: '0',
totalComments:{$sum:'$comments'}
}
}
]
}
As you already found, you cannot $match on fields that are not in the document (it works exactly the same way that find works) and if you use $project first then you will lose the ability to use indexes.
What you can do instead is combine your efforts as follows:
{
aggregate: 'posts',
pipeline: [
{$match: {
created_date :
{$gte:{$date:'2012-09-01T04:00:00Z'},
$lt: {date:'2012-10-01T04:00:00Z'}
}}
}
},
{$group:
{_id: '0',
totalComments:{$sum:'$comments'}
}
}
]
}
The above only gives you aggregation for September, if you wanted to aggregate for multiple months, you can for example:
{
aggregate: 'posts',
pipeline: [
{$match: {
created_date :
{ $gte:'2012-07-01T04:00:00Z',
$lt: '2012-10-01T04:00:00Z'
}
},
{$project: {
comments: 1,
new_created: {
"yr" : {"$year" : "$created_date"},
"mo" : {"$month" : "$created_date"}
}
}
},
{$group:
{_id: "$new_created",
totalComments:{$sum:'$comments'}
}
}
]
}
and you'll get back something like:
{
"result" : [
{
"_id" : {
"yr" : 2012,
"mo" : 7
},
"totalComments" : 5
},
{
"_id" : {
"yr" : 2012,
"mo" : 8
},
"totalComments" : 19
},
{
"_id" : {
"yr" : 2012,
"mo" : 9
},
"totalComments" : 21
}
],
"ok" : 1
}
Let's look at building some pipelines that involve operations that are already familiar to us. So, we're going to look at the following stages:
match
- this is filtering stage, similar to find
.project
sort
skip
limit
We might ask ourself why these stages are necessary, given that this functionality is already provided in the MongoDB
query language, and the reason is because we need these stages to support the more complex analytics-oriented functionality that's included with the aggregation framework. The below query is simply equal to a find
:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, ])
Let's introduce a project stage in this aggregation pipeline:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, {
$project: {
_id: 0,
name: 1,
founded_year: 1
}
}])
We use aggregate
method for implementing aggregation framework. The aggregation pipelines are merely an array of documents. Each of the document should stipulate a particular stage operator. So, in the above case we've an aggregation pipeline with two stages. The $match
stage is passing the documents one at a time to $project
stage.
Let's extend to limit
stage:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, {
$limit: 5
}, {
$project: {
_id: 0,
name: 1
}
}])
This gets the matching documents and limits to five before projecting out the fields. So, projection is working only on 5 documents. Assume, if we were to do something like this:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, {
$project: {
_id: 0,
name: 1
}
}, {
$limit: 5
}])
This gets the matching documents and projects those large number of documents and finally limits to five. So, projection is working on large number of documents and finally limiting to 5. This gives us a lesson that we should limit the documents to those which are absolutely necessary to be passed to the next stage. Now, let's look at sort
stage:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, {
$sort: {
name: 1
}
}, {
$limit: 5
}, {
$project: {
_id: 0,
name: 1
}
}])
This will sort all documents by name and give only 5 out of them. Assume, if we were to do something like this:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, {
$limit: 5
}, {
$sort: {
name: 1
}
}, {
$project: {
_id: 0,
name: 1
}
}])
This will take first 5 documents and sort them. Let's add the skip
stage:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, {
$sort: {
name: 1
}
}, {
$skip: 10
}, {
$limit: 5
}, {
$project: {
_id: 0,
name: 1
}
}, ])
This will sort all the documents and skip the initial 10 documents and return to us. We should try to include $match
stages as early as possible in the pipeline. To filter documents using a $match
stage, we use the same syntax for constructing query documents (filters) as we do for find()
.