filtering of tweets received from statuses/filter (streaming API)

问题

I have N different keywords that i am tracking (for sake of simplicity, let N=3). So in GET statuses/filter, I will give 3 keywords in the "track" argument.

Now the tweets that i will be receiving can be from ANY of the 3 keywords that i mentioned. The problem is that i want to resolve as to which tweet corresponds to which keyword. i.e. mapping between tweets and the keyword(s) (that are mentioned in the "track" argument).

Apparently, there is no way to do this without doing any processing on the tweets received.

So i was wondering what is the best way to do this processing? Search for keywords in the text of the tweet? What about case-insensitive? What about when multiple words are there in same keyword, e.g: "Katrina Kaif" ?

I am currently trying to formulate some regular expression...

I was thinking the BEST way would to use the same logic (regular expressions etc.) as is used originally be statuses/filter API. How to know what logic is used by Twitter API statuses/filter itself to match tweets to the keywords ?

Advice? Help?

P.S.: I am using Python, Tweepy, Regex, MongoDb/Apache S4 (for distributed computing)

回答1:

The first thing coming into my mind is to create a separate stream for every keyword and start it in a separate thread, like this:

from threading import Thread
import tweepy


class StreamListener(tweepy.StreamListener):
    def __init__(self, keyword, api=None):
        super(StreamListener, self).__init__(api)
        self.keyword = keyword

    def on_status(self, tweet):
        print 'Ran on_status'

    def on_error(self, status_code):
        print 'Error: ' + repr(status_code)
        return False

    def on_data(self, data):
        print self.keyword, data
        print 'Ok, this is actually running'


def start_stream(auth, track):
    tweepy.Stream(auth=auth, listener=StreamListener(track)).filter(track=[track])


auth = tweepy.OAuthHandler(<consumer_key>, <consumer_secret>)
auth.set_access_token(<key>, <secret>)

track = ['obama', 'cats', 'python']
for item in track:
    thread = Thread(target=start_stream, args=(auth, item))
    thread.start()

If you still want to distinguish tweets by keywords by yourself in a single stream, here's some info on how twitter uses track request parameter. There are some edge cases that could cause problems.

Hope that helps.

回答2:

Return list of any/all 'triggered' track terms

I had a very related issue and i solved it by list comprehension. That is, I had a list of raw tweets, and my track filter terms as 'listoftermstofind' and 'rawtweetlist'. Then you can run the following to return a list of lists of any and all track terms that were found in each tweet.

j=[x.upper() for x in listoftermstofind] #your track filters, but making case insensitive
ListOfTweets=[x.upper() for x in rawtweetlist] #converting case to upper for all tweets
triggers=list(map(lambda y: list(filter(lambda x: x in y, j)), ListOfTweets))

This works well, because the track filters in the API are specific (down to the character level) rather than any natural language search processing or anything like that. I recommend reading the API docs on filtering in detail, it goes through the usage quite well: https://dev.twitter.com/streaming/overview/request-parameters

来源：https://stackoverflow.com/questions/16602483/filtering-of-tweets-received-from-statuses-filter-streaming-api

标签

python

twitter

tweepy

tweetstream