Tracking keywords in a live stream of tweets

前端 未结 2 1387
予麋鹿
予麋鹿 2021-01-27 17:39

I installed and tried out tweepy, I am using the following function right now:

from API Reference

API.public_timeline()

Returns the 20 most rec

相关标签:
2条回答
  • 2021-01-27 18:22

    Take a look at the streaming API. You can even subscribe to a list of words that you define, and only tweets that match those words are returned.

    The streaming API rate limiting works differently: you get 1 connection per IP, and a maximum number of events per second. If more events occur than that, then you only get the maximum anyways, with a notification regarding how many events you missed because of rate limiting.

    My understanding is that the streaming API is most suitable for servers that will redistribute the content to your users as needed, instead of being accessed directly by your users - the standing connections are expensive and Twitter starts blacklisting IPs after too many failed connections and re-connections, and possibly your API key afterwards.

    0 讨论(0)
  • 2021-01-27 18:30

    The streaming API is what you want. I use a library called tweetstream. Here's my basic listening function:

    def retrieve_tweets(numtweets=10, *args):
    """
    This function optionally takes one or more arguments as keywords to filter tweets.
    It iterates through tweets from the stream that meet the given criteria and sends them 
    to the database population function on a per-instance basis, so as to avoid disaster 
    if the stream is disconnected.
    
    Both SampleStream and FilterStream methods access Twitter's stream of status elements.
    For status element documentation, (including proper arguments for tweet['arg'] as seen
    below) see https://dev.twitter.com/docs/api/1/get/statuses/show/%3Aid.
    """   
    filters = []
    for key in args:
        filters.append(str(key))
    if len(filters) == 0:
        stream = tweetstream.SampleStream(username, password)  
    else:
        stream = tweetstream.FilterStream(username, password, track=filters)
    try:
        count = 0
        while count < numtweets:       
            for tweet in stream:
                # a check is needed on text as some "tweets" are actually just API operations
                # the language selection doesn't really work but it's better than nothing(?)
                if tweet.get('text') and tweet['user']['lang'] == 'en':   
                    if tweet['retweet_count'] == 0:
                        # bundle up the features I want and send them to the db population function
                        bundle = (tweet['id'], tweet['user']['screen_name'], tweet['retweet_count'], tweet['text'])
                        db_initpop(bundle)
                        break
                    else:
                        # a RT has a different structure.  This bundles the original tweet.  Getting  the
                        # retweets comes later, after the stream is de-accessed.
                        bundle = (tweet['retweeted_status']['id'], tweet['retweeted_status']['user']['screen_name'], \
                                  tweet['retweet_count'], tweet['retweeted_status']['text'])
                        db_initpop(bundle)
                        break
            count += 1
    except tweetstream.ConnectionError, e:
        print 'Disconnected from Twitter at '+time.strftime("%d %b %Y %H:%M:%S", time.localtime()) \
        +'.  Reason: ', e.reason
    

    I haven't looked in a while, but I'm pretty sure that this library is just accessing the sample stream (as opposed to the firehose). HTH.

    Edit to add: you say you want the "complete live stream", aka the firehose. That's fiscally and technically expensive and only very large companies are allowed to have it. Look at the docs and you'll see that the sample is basically representative.

    0 讨论(0)
提交回复
热议问题