searchTwitter() in twitteR package for R (2.15.2) - high number of duplicate tweets

问题

Trying to create a dataframe of Twitter usernames associated with keyword through pulls from the Twitter REST API. But queries using searchTwitter() on many search terms (e.g. #rstats), even for large samples like n = 1000, return high degree (>90%) of duplicate tweets.

A specific example would be:

tweets <- searchTwitter("#rstats", n = 1000)
tweets.df <- do.call("rbind", lapply(tweets, as.data.frame))

df.undup <- df[duplicated(tweets.df) == FALSE,]
dim(df.undup)

I'm wondering if this is caused by limits on pagination if the search term is relatively scarce?

回答1:

First of all, should the 3rd line in your code be df.undup <- tweets.df[duplicated(tweets.df) == FALSE,] ?

I guess you're getting less than 1000 tweets, when you run the above code (I got 604, and the result of dim(df.undup) is 604 10). So the problem, I guess, is not that of duplicates being there, but that there are lesser number of tweets than 1000.

If you look at the created date, the oldest tweets are from 14th March (a week ago). Twitter API usuallly usually doesn't allow one to access tweets more than 7-9 days old. I guess that's why you're getting a lesser number of tweets.

To check, see if dim(tweets.df) and dim(undup.df) return the same thing.

来源：https://stackoverflow.com/questions/15548316/searchtwitter-in-twitter-package-for-r-2-15-2-high-number-of-duplicate-twe

标签

twitter

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!