I'd like to apply qdap
's polarity
function to a vector of documents, each of which could contain multiple sentences, and obtain the corresponding polarity for each document. For example:
library(qdap)
polarity(DATA$state)$all$polarity
# Results:
[1] -0.8165 -0.4082 0.0000 -0.8944 0.0000 0.0000 0.0000 -0.5774 0.0000
[10] 0.4082 0.0000
Warning message:
In polarity(DATA$state) :
Some rows contain double punctuation. Suggested use of `sentSplit` function.
This warning can't be ignored, as it seems to add the polarity scores of each sentence in the document. This can result in document-level polarity scores outside the [-1, 1] bounds.
I'm aware of the option to first run sentSplit
and then average across the sentences, perhaps weighting polarity by word count, but this is (1) inefficient (takes roughly 4x as long as running on the full documents with the warning), and (2) unclear how to weight sentences. This option would look something like this:
DATA$id <- seq(nrow(DATA)) # For identifying and aggregating documents
sentences <- sentSplit(DATA, "state")
library(data.table) # For aggregation
pol.dt <- data.table(polarity(sentences$state)$all)
pol.dt[, id := sentences$id]
document.polarity <- pol.dt[, sum(polarity * wc) / sum(wc), "id"]
I was hoping I could run polarity
on a version of the vector with periods removed, but it seems that sentSplit
does more than that. This works on DATA
but not on other sets of text (I'm unsure of the full set of breaks other than periods).
So, I suspect the best way of approaching this is to make each element of the document vector look like one long sentence. How would I do this, or is there another way?
Max found a bug in this version of qdap (1.3.4) that counted a place holder as a word which affect the equation since the denominator is sqrt(n)
where n
is the word count. As of 1.3.5 this has been corrected, hence why the two different outputs did not match.
Here is the output:
library(qdap)
counts(polarity(DATA$state))[, "polarity"]
## > counts(polarity(DATA$state))[, "polarity"]
## [1] -0.8164966 -0.4472136 0.0000000 -1.0000000 0.0000000 0.0000000 0.0000000
## [8] -0.5773503 0.0000000 0.4082483 0.0000000
## Warning message:
## In polarity(DATA$state) :
## Some rows contain double punctuation. Suggested use of `sentSplit` function.
In this case using strip
does not matter. It may in more complex situations involving amplifiers, negators, negatives, and commas. Here is an example:
## > counts(polarity("Really, I hate it"))[, "polarity"]
## [1] -0.5
## > counts(polarity(strip("Really, I hate it")))[, "polarity"]
## [1] -0.9
see the documentation for more.
Looks like removing punctuation and other extras tricks polarity
into thinking the vector is a single sentence:
SimplifyText <- function(x) {
return(removePunctuation(removeNumbers(stripWhitespace(tolower(x)))))
}
polarity(SimplifyText(DATA$state))$all$polarity
# Result (no warning)
[1] -0.8165 -0.4472 0.0000 -1.0000 0.0000 0.0000 0.0000 -0.5774 0.0000
[10] 0.4082 0.0000
来源:https://stackoverflow.com/questions/22774913/estimating-document-polarity-using-rs-qdap-package-without-sentsplit