问题
Objective
Fix my make_pipeline()
function to, using an aggregation query, count the number of tweets for each user, add them to an array and return the 5 users with the most tweets.
Exercise
Using an aggregation query, count the number of tweets for each user.
In the same $group
stage, use $push
to accumulate all the tweet texts for each user.
Limit your output to the 5 users with the most tweets.
Your result documents should include only the fields:
"_id"
(screen name of user),"count"
(number of tweets found for the user),"tweet_texts"
(a list of the tweet texts found for the user).
Background
To achieve the previous objective I am testing the following code:
def make_pipeline():
# complete the aggregation pipeline
pipeline = [
{"$group": {"_id": "$user.screen_name", "tweet_texts": {"$push": "$text"}, "count": {"$sum": 1}}},
{"$project": {"_id": "$user.screen_name", "count": 1, "tweet_texts": 1}},
{"$sort" : {"count" : -1}},
{"$limit": 5}
]
return pipeline
Logic
First I group all the tweets by username
. Then, in the same stage, I push all the texted tweets to tweet_texts
and I count each occurrence that was grouped. I believe this will give me the number of users with most tweets.
Then I make a projection to select only the three fields I want:
- _id
- count
- tweet_texts
I finish by sorting and limiting the amount of results.
Problem
I am passing the test, but not the submission. What am I doing wrong? I now the error must be in the first (group) stage, but I can't find for the love of God what I am doing wrong.
Data Sample
{
"_id" : ObjectId("5304e2e3cc9e684aa98bef97"),
"text" : "First week of school is over :P",
"in_reply_to_status_id" : null,
"retweet_count" : null,
"contributors" : null,
"created_at" : "Thu Sep 02 18:11:25 +0000 2010",
"geo" : null,
"source" : "web",
"coordinates" : null,
"in_reply_to_screen_name" : null,
"truncated" : false,
"entities" : {
"user_mentions" : [ ],
"urls" : [ ],
"hashtags" : [ ]
},
"retweeted" : false,
"place" : null,
"user" : {
"friends_count" : 145,
"profile_sidebar_fill_color" : "E5507E",
"location" : "Ireland :)",
"verified" : false,
"follow_request_sent" : null,
"favourites_count" : 1,
"profile_sidebar_border_color" : "CC3366",
"profile_image_url" : "http://a1.twimg.com/profile_images/1107778717/phpkHoxzmAM_normal.jpg",
"geo_enabled" : false,
"created_at" : "Sun May 03 19:51:04 +0000 2009",
"description" : "",
"time_zone" : null,
"url" : null,
"screen_name" : "Catherinemull",
"notifications" : null,
"profile_background_color" : "FF6699",
"listed_count" : 77,
"lang" : "en",
"profile_background_image_url" : "http://a3.twimg.com/profile_background_images/138228501/149174881-8cd806890274b828ed56598091c84e71_4c6fd4d8-full.jpg",
"statuses_count" : 2475,
"following" : null,
"profile_text_color" : "362720",
"protected" : false,
"show_all_inline_media" : false,
"profile_background_tile" : true,
"name" : "Catherine Mullane",
"contributors_enabled" : false,
"profile_link_color" : "B40B43",
"followers_count" : 169,
"id" : 37486277,
"profile_use_background_image" : true,
"utc_offset" : null
},
"favorited" : false,
"in_reply_to_user_id" : null,
"id" : NumberLong("22819398300")
}
Please help!
回答1:
The $project step is redundant as the $group pipeline already produces just those three fields so there's no need for a preceding $project stage.
The correct pipeline should be
pipeline = [
{
"$group": {
"_id": "$user.screen_name",
"tweet_texts": { "$push": "$text" },
"count": { "$sum": 1 }
}
},
{ "$sort" : { "count" : -1 } },
{ "$limit": 5 }
]
Your $project pipeline didn't work because the previous $group pipeline doesn't produce any field "$user.screen_name"
which you attempt to use as the _id
field in the $project pipeline.
However, if you wanted to include the $project step then the working pipeline should follow:
pipeline = [
{
"$group": {
"_id": "$user.screen_name",
"tweet_texts": { "$push": "$text" },
"count": { "$sum": 1 }
}
},
{ "$project": { "count": 1, "tweet_texts": 1 } },
{ "$sort" : { "count" : -1 } },
{ "$limit": 5 }
]
回答2:
Reading comments
Reading the comments I found out that
pipeline = [
{"$group": {"_id": "$user.screen_name", "tweet_texts": {"$push": "$text"}, "count": {"$sum": 1}}},
{"$project": {"_id": "$user.screen_name", "count": 1, "tweet_texts": 1}},
{"$sort" : {"count" : -1}},
{"$limit": 5}
]
Should in fact be changed to:
pipeline = [
{"$group": {"_id": "$user.screen_name", "tweet_texts": {"$push": "$text"}, "count": {"$sum": 1}}},
{"$sort" : {"count" : -1}},
{"$limit": 5}
]
Why?
The full answer and explanation can be seen in the answer:
- What is the difference between these two MongoDB queries?
The conclusion of the story is that I am using the $project
stage wrongly. Not only was is no needed in the first place, to make it idempotent it should be
{"$project": {"_id": "$_id", "count": 1, "tweet_texts": 1}},
I also highly recommend his answer:
- https://stackoverflow.com/a/40970901/1337392
Special Thanks
The following users deserve kudos++:
- https://stackoverflow.com/users/122005/chridam
For directing me in to the right path!
来源:https://stackoverflow.com/questions/40868553/using-push-with-group-with-pymongo