actually I am trying to do a sentiment analysis based on twitter data using the naive bayes algorithm.
I have a look on 2000 Tweets.
After getting the data into R studio I split and preprocess the date as follows:
train_size = floor(0.75 * nrow(Tweets_Model_Input))
set.seed(123)
train_sub = sample(seq_len(nrow(Tweets_Model_Input)), size = train_size)
Tweets_Model_Input_Train = Tweets_Model_Input[train_sub, ]
Tweets_Model_Input_Test = Tweets_Model_Input[-train_sub, ]
myCorpus = Corpus(VectorSource(Tweets_Model_Input_Train$SentimentText))
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english")) #removes common prepositions and conjunctions
myCorpus <- tm_map(myCorpus, stripWhitespace)
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, removeURL)
removeRetweet <- function(x) gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", x)
myCorpus <- tm_map(myCorpus, removeRetweet)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, PlainTextDocument)
myCorpus.train <- tm_map(myCorpus, stemDocument, language = "english")
myCorpus.train <- Corpus(VectorSource(myCorpus.train$content))
myCorpus = Corpus(VectorSource(Tweets_Model_Input_Test$SentimentText))
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english")) #removes common prepositions and conjunctions
myCorpus <- tm_map(myCorpus, stripWhitespace)
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, removeURL)
removeRetweet <- function(x) gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", x)
myCorpus <- tm_map(myCorpus, removeRetweet)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, PlainTextDocument)
myCorpus.test <- tm_map(myCorpus, stemDocument, language = "english")
myCorpus.test <- Corpus(VectorSource(myCorpus.test$content))
So I get a train and a test corpus for my NB algorithm. After doing that I would like to create two DTM's based on the terms which appear at least 50 times in the train corpus. These terms are : "get" "miss" "day" "just" "now" "want" "good" "work"
fivefreq = findFreqTerms(dtm.train, lowfreq = 50, highfreq = Inf)
length((fivefreq))
dtm.train <- DocumentTermMatrix(myCorpus.train, control=list(dictionary = fivefreq))
dtm.test <- DocumentTermMatrix(myCorpus.test, control=list(dictionary = fivefreq))
For dtm.train it works pretty well, but for dtm.test it doesn't work at all. The DTM is based in the terms selected above, but the count numbers in the matrix itself are not correct.
Tweet no. 1 of the training corpus is "omg celli happen yearswtf gota get bill paid". The subset of the DTM is correct:
Tweet no. 3 of the test corpus is "huge roll thunder just nowso scari". The subset of the DTM is not correct:
There is not "get" in that tweets. But there is a "just". So the counting is somehow right, but in the wrong column.
I tried so much to solve that problem but actually I don't know anything else to do. For me it seems like that tm is creating the DTM based on the terms of the specific corpus and the dictionary is only used to replace the column name without any function.
Thanks for your help!
Edit: this is an actual bug. Using VCorpus() instead of Corpus() will also fix the problem.
This seems to be an actual bug. Try reverting back to version 0.6-2. That fixed the problem for me.
来源:https://stackoverflow.com/questions/43322672/documenttermmatrix-wrong-counting-when-using-a-dictionary