问题
I want to do a sentiment analysis of German tweets. The code I use works fine with English, but when I load the German word list, all scores just result zero. As far as I can guess, it must have to do with the different structures of the word lists. So what I need to know is, how to adapt my code to the structure of the German word-list. Someone could take a look at both of the lists ?
English Wordlist
German Wordlist
# load the wordlists
pos.words = scan("~/positive-words.txt",what='character', comment.char=';')
neg.words = scan("~/negative-words.txt",what='character', comment.char=';')
# bring in the sentiment analysis algorithm
# we got a vector of sentences. plyr will handle a list or a vector as an "l"
# we want a simple array of scores back, so we use "l" + "a" + "ply" = laply:
score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
require(plyr)
require(stringr)
scores = laply(sentences, function(sentence, pos.words, neg.words)
{
# clean up sentences with R's regex-driven global substitute, gsub():
sentence = gsub('[[:punct:]]', '', sentence)
sentence = gsub('[[:cntrl:]]', '', sentence)
sentence = gsub('\\d+', '', sentence)
# and convert to lower case:
sentence = tolower(sentence)
# split into words. str_split is in the stringr package
word.list = str_split(sentence, '\\s+')
# sometimes a list() is one level of hierarchy too much
words = unlist(word.list)
# compare our words to the dictionaries of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
# match() returns the position of the matched term or NA
# we just want a TRUE/FALSE:
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
score = sum(pos.matches) - sum(neg.matches)
return(score)
},
pos.words, neg.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}
# and to see if it works, there should be a score...either in German or in English
sample = c("ich liebe dich. du bist wunderbar","I hate you. Die!");sample
test.sample = score.sentiment(sample, pos.words, neg.words);test.sample
回答1:
This may work for you:
readAndflattenSentiWS <- function(filename) {
words = readLines(filename, encoding="UTF-8")
words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words)
words <- unlist(strsplit(words, ","))
words <- tolower(words)
return(words)
}
pos.words <- c(scan("positive-words.txt",what='character', comment.char=';', quiet=T),
readAndflattenSentiWS("SentiWS_v1.8c_Positive.txt"))
neg.words <- c(scan("negative-words.txt",what='character', comment.char=';', quiet=T),
readAndflattenSentiWS("SentiWS_v1.8c_Negative.txt"))
score.sentiment = function(sentences, pos.words, neg.words, .progress='none') {
# ... see OP ...
}
sample <- c("ich liebe dich. du bist wunderbar",
"Ich hasse dich, geh sterben!",
"i love you. you are wonderful.",
"i hate you, die.")
(test.sample <- score.sentiment(sample,
pos.words,
neg.words))
# score text
# 1 2 ich liebe dich. du bist wunderbar
# 2 -2 ich hasse dich, geh sterben!
# 3 2 i love you. you are wonderful.
# 4 -2 i hate you, die.
回答2:
In the German list the list are with this names: SentiWS_v1.8c_Negative.txt, and SentiWS_v1.8c_Positive.txt No in the way you are loading, this only works for the English version:
pos.words = scan("~/positive-words.txt",what='character', comment.char=';')
neg.words = scan("~/negative-words.txt",what='character', comment.char=';')
Apart from that the list are in different format:
The German version, is like that:
Abbau|NN -0.058 Abbaus,Abbaues,Abbauen,Abbaue
Abbruch|NN -0.0048 Abbruches,Abbrüche,Abbruchs,Abbrüchen
Abdankung|NN -0.0048 Abdankungen
Abdämpfung|NN -0.0048 Abdämpfungen
Abfall|NN -0.0048 Abfalles,Abfälle,Abfalls,Abfällen
Abfuhr|NN -0.3367 Abfuhren
The English version:
charismatic
charitable
charm
charming
charmingly
chaste
cheaper
cheapest
The German ones follow this pattern: word|NN\tnumber <similar words comma separated>\n
The English ones follow this pattern word\n
And the heading of each document is different so you might want to skip the heading (In the English list seems an article, not tweets, or words of tweets)
Solution, get the format of the two files to be the same, and then do whatever you want or prepare your code to read from two types of data.
Now you have your program working for the English version, so I suggest to change the format of the German list. You could change each space or comma for a \n
and then eliminate all the |NN and numbers.
来源:https://stackoverflow.com/questions/22116938/twitter-sentiment-analysis-w-r-using-german-language-set-sentiws