I\'ve been learning Python for a couple of months through online courses and would like to further my learning through a real world mini project.
For this project,
I just insert the raw JSON into the database. It seems a bit ugly and hacky but it does work. A noteable problem is that the creation dates of the Tweets are stored as strings. How do I compare dates from Twitter data stored in MongoDB via PyMongo? provides a way to fix that (I inserted a comment in the code to indicate where one would perform that task)
# ...
client = pymongo.MongoClient()
db = client.twitter_db
twitter_collection = db.tweets
# ...
class CustomStreamListener(tweepy.StreamListener):
# ...
def on_status(self, status):
try:
twitter_json = status._json
# TODO: Transform created_at to Date objects before insertion
tweet_id = twitter_collection.insert(twitter_json)
except:
# Catch any unicode errors while printing to console
# and just ignore them to avoid breaking application.
pass
# ...
stream = tweepy.Stream(auth, CustomStreamListener(), timeout=None, compression=True)
stream.sample()
In rereading your original question, I realize that you ask a lot of smaller questions. I'll try to answer most of them here but some may merit actually asking a separate question on SO.
on_data
?Without seeing the actual error, it's hard to say. It actually didn't work for me until I regenerated my consumer/acess keys, I'd try that.
There are a few things I might do differently than your answer.
tweets
is a global list. This means that if you have multiple StreamListeners
(i.e. in multiple threads), every tweet collected by any stream listener will be added to this list. This is because lists in Python refer to locations in memory--if that's confusing, here's a basic example of what I mean:
>>> bar = []
>>> foo = bar
>>> foo.append(7)
>>> print bar
[7]
Notice that even though you thought appended 7 to foo
, foo
and bar
actually refer to the same thing (and therefore changing one changes both).
If you meant to do this, it's a pretty great solution. However, if your intention was to segregate tweets from different listeners, it could be a huge headache. I personally would construct my class like this:
class CustomStreamListener(tweepy.StreamListener):
def __init__(self, api):
self.api = api
super(tweepy.StreamListener, self).__init__()
self.list_of_tweets = []
This changes the tweets list to be only in the scope of your class. Also, I think it's appropriate to change the property name from self.save_file
to self.list_of_tweets
because you also name the file that you're appending the tweets to save_file
. Although this will not strictly cause an error, it's confusing to human me that self.save_file
is a list and save_file
is a file. It helps future you and anyone else that reads your code figure out what the heck everything does/is. More on variable naming.
In my comment, I mentioned that you shouldn't use file
as a variable name. file
is a Python builtin function that constructs a new object of type file
. You can technically overwrite it, but it is a very bad idea to do so. For more builtins, see the Python documentation.
All keywords are OR
'd together in this type of search, source:
sapi.filter(track=['twitter', 'python', 'tweepy'])
This means that this will get tweets containing 'twitter', 'python' or 'tweepy'. If you want the union (AND
) all of the terms, you have to post-process by checking a tweet against the list of all terms you want to search for.
I actually just realized that you did ask this as its own question as I was about to suggest. A regex post-processing solution is a good way to accomplish this. You could also try filtering by both location and keyword like so:
sapi.filter(locations=[103.60998,1.25752,104.03295,1.44973], track=['twitter'])
That depends on how many you'll be collecting. I'm a fan of databases, especially if you're planning to do a sentiment analysis on a lot of tweets. When you collect data, you should only collect things you will need. This means, when you save results to your database/wherever in your on_data
method, you should extract the important parts from the JSON and not save anything else. If for example you want to look at brand, country and time, only take those three things; don't save the entire JSON dump of the tweet because it'll just take up unnecessary space.
I found a way to save the tweets to a json file. Happy to hear how it can be improved!
# initialize blank list to contain tweets
tweets = []
# file name that you want to open is the second argument
save_file = open('9may.json', 'a')
class CustomStreamListener(tweepy.StreamListener):
def __init__(self, api):
self.api = api
super(tweepy.StreamListener, self).__init__()
self.save_file = tweets
def on_data(self, tweet):
self.save_file.append(json.loads(tweet))
print tweet
save_file.write(str(tweet))