What's a good set of heuristics for threading tweets?

我只是一个虾纸丫 提交于 2019-11-30 06:49:24

Since there's only been one answer, and the bounty deadline is approaching soon, I thought I should add a baseline answer so the bounty isn't automatically awarded to an answer that doesn't add much beyond what's in the question.

The obvious first step is to take your original set of tweets and follow all in_reply_to_status_id links to build many directed acyclic graphs. These relationships you can be nearly 100% sure about. (You should follow the links even through tweets that aren't in the original set, adding those to the set of status updates that you're considering.)

Beyond that easy step, one has to do deal with the "mentions". Unlike in email threading, there's nothing helpful like a subject line that one can match on - this is inevitably going to be very error prone. The approach I would take is to create a feature vector for every possible relationship between status IDs that might be represented by mentions in that tweet, and then train a classifier to guess the best option, including a "no reply" option.

To work out the "every possible relationship" bit, start by considering every status update that mentions one or more other users and doesn't contain an in_reply_to_status_id. Suppose an example of one of these tweets is: 1

@a @b no it isn't lol  RT @c Yes, absolutely. /cc @stephenfry

... you would create a feature vector for the relationship between this update and every update with an earlier date in the timelines of @a, @b, @c, and @stephenfry for the last week (say) and one between that update and a special "no reply" update. Then you have to create a feature vector - you can add to this whatever you would like, but I would at least suggest adding:

  • The time that elapsed between the two updates - presumably replies are more likely to be to recent updates.
  • The proportion of the way through the tweet in terms of words that a mention occurs. e.g. if this is the first word, this would be a score of 0 and that's probably more likely to indicate a reply than mentions later in the update.
  • The number of followers of the mentioned user - celebrities are presumably more likely to be spam-mentioned.
  • The length of the longest common substring between the updates, which might indicate direct quoting.
  • Is the mention preceded by "/cc" or other signifiers that indicate that this isn't directly a reply to that person?
  • The following / followed ratio for the author of the original update.
  • etc.
  • etc.

The more of these one can come up with the better, since the classifier will only use those that turn out to be useful. I'd suggest trying a random forest classifier, which is conveniently implemented in Weka.

Next one needs a training set. This can be small at first - just enough to get a service that identifies conversations up-and-running. To this basic service, one would have to add a nice interface for correcting mismatched or falsely linked updates, so that users can correct them. Using this data one can build a bigger training set and a more accurate classifier.

1... which might be typical of the level of discourse on Twitter ;)

On Twitter, people often write "RT" in front of the message they are replying to.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!