DocumentTermMatrix wrong counting when using a dictionary

大兔子大兔子 提交于 2019-12-07 18:26:36

问题


actually I am trying to do a sentiment analysis based on twitter data using the naive bayes algorithm.

I have a look on 2000 Tweets.

After getting the data into R studio I split and preprocess the date as follows:

train_size = floor(0.75 * nrow(Tweets_Model_Input))
set.seed(123)
train_sub = sample(seq_len(nrow(Tweets_Model_Input)), size = train_size)

Tweets_Model_Input_Train = Tweets_Model_Input[train_sub, ]
Tweets_Model_Input_Test = Tweets_Model_Input[-train_sub, ]

myCorpus = Corpus(VectorSource(Tweets_Model_Input_Train$SentimentText))
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english")) #removes common prepositions and conjunctions 
myCorpus <- tm_map(myCorpus, stripWhitespace)
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, removeURL)
removeRetweet <- function(x) gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", x)
myCorpus <- tm_map(myCorpus, removeRetweet)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, PlainTextDocument)
myCorpus.train <- tm_map(myCorpus, stemDocument, language = "english")  
myCorpus.train <- Corpus(VectorSource(myCorpus.train$content))


myCorpus = Corpus(VectorSource(Tweets_Model_Input_Test$SentimentText))
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english")) #removes common prepositions and conjunctions 
myCorpus <- tm_map(myCorpus, stripWhitespace)
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, removeURL)
removeRetweet <- function(x) gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", x)
myCorpus <- tm_map(myCorpus, removeRetweet)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, PlainTextDocument)
myCorpus.test <- tm_map(myCorpus, stemDocument, language = "english") 
myCorpus.test <- Corpus(VectorSource(myCorpus.test$content))

So I get a train and a test corpus for my NB algorithm. After doing that I would like to create two DTM's based on the terms which appear at least 50 times in the train corpus. These terms are : "get" "miss" "day" "just" "now" "want" "good" "work"

fivefreq = findFreqTerms(dtm.train, lowfreq = 50, highfreq = Inf)
length((fivefreq))

dtm.train <- DocumentTermMatrix(myCorpus.train, control=list(dictionary = fivefreq))
dtm.test <- DocumentTermMatrix(myCorpus.test, control=list(dictionary = fivefreq))

For dtm.train it works pretty well, but for dtm.test it doesn't work at all. The DTM is based in the terms selected above, but the count numbers in the matrix itself are not correct.

Tweet no. 1 of the training corpus is "omg celli happen yearswtf gota get bill paid". The subset of the DTM is correct:

Tweet no. 3 of the test corpus is "huge roll thunder just nowso scari". The subset of the DTM is not correct:

There is not "get" in that tweets. But there is a "just". So the counting is somehow right, but in the wrong column.

I tried so much to solve that problem but actually I don't know anything else to do. For me it seems like that tm is creating the DTM based on the terms of the specific corpus and the dictionary is only used to replace the column name without any function.

Thanks for your help!


回答1:


Edit: this is an actual bug. Using VCorpus() instead of Corpus() will also fix the problem.

This seems to be an actual bug. Try reverting back to version 0.6-2. That fixed the problem for me.



来源:https://stackoverflow.com/questions/43322672/documenttermmatrix-wrong-counting-when-using-a-dictionary

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!