Tracking keywords in a live stream of tweets

前端未结

关注

 2  1387

予麋鹿

I installed and tried out tweepy, I am using the following function right now:

from API Reference

API.public_timeline()

Returns the 20 most rec

相关标签:

2条回答

闹比i

2021-01-27 18:22

Take a look at the streaming API. You can even subscribe to a list of words that you define, and only tweets that match those words are returned.

The streaming API rate limiting works differently: you get 1 connection per IP, and a maximum number of events per second. If more events occur than that, then you only get the maximum anyways, with a notification regarding how many events you missed because of rate limiting.

My understanding is that the streaming API is most suitable for servers that will redistribute the content to your users as needed, instead of being accessed directly by your users - the standing connections are expensive and Twitter starts blacklisting IPs after too many failed connections and re-connections, and possibly your API key afterwards.

0 讨论(0)
发布评论:

提交评论
- 加载中...

花落未央

2021-01-27 18:30

The streaming API is what you want. I use a library called tweetstream. Here's my basic listening function:

def retrieve_tweets(numtweets=10, *args):
"""
This function optionally takes one or more arguments as keywords to filter tweets.
It iterates through tweets from the stream that meet the given criteria and sends them 
to the database population function on a per-instance basis, so as to avoid disaster 
if the stream is disconnected.

Both SampleStream and FilterStream methods access Twitter's stream of status elements.
For status element documentation, (including proper arguments for tweet['arg'] as seen
below) see https://dev.twitter.com/docs/api/1/get/statuses/show/%3Aid.
"""   
filters = []
for key in args:
    filters.append(str(key))
if len(filters) == 0:
    stream = tweetstream.SampleStream(username, password)  
else:
    stream = tweetstream.FilterStream(username, password, track=filters)
try:
    count = 0
    while count < numtweets:       
        for tweet in stream:
            # a check is needed on text as some "tweets" are actually just API operations
            # the language selection doesn't really work but it's better than nothing(?)
            if tweet.get('text') and tweet['user']['lang'] == 'en':   
                if tweet['retweet_count'] == 0:
                    # bundle up the features I want and send them to the db population function
                    bundle = (tweet['id'], tweet['user']['screen_name'], tweet['retweet_count'], tweet['text'])
                    db_initpop(bundle)
                    break
                else:
                    # a RT has a different structure.  This bundles the original tweet.  Getting  the
                    # retweets comes later, after the stream is de-accessed.
                    bundle = (tweet['retweeted_status']['id'], tweet['retweeted_status']['user']['screen_name'], \
                              tweet['retweet_count'], tweet['retweeted_status']['text'])
                    db_initpop(bundle)
                    break
        count += 1
except tweetstream.ConnectionError, e:
    print 'Disconnected from Twitter at '+time.strftime("%d %b %Y %H:%M:%S", time.localtime()) \
    +'.  Reason: ', e.reason

I haven't looked in a while, but I'm pretty sure that this library is just accessing the sample stream (as opposed to the firehose). HTH.

Edit to add: you say you want the "complete live stream", aka the firehose. That's fiscally and technically expensive and only very large companies are allowed to have it. Look at the docs and you'll see that the sample is basically representative.

0 讨论(0)