Estimating document polarity using R's qdap package without sentSplit

社会主义新天地 提交于 2019-12-05 12:33:40

Max found a bug in this version of qdap (1.3.4) that counted a place holder as a word which affect the equation since the denominator is sqrt(n) where n is the word count. As of 1.3.5 this has been corrected, hence why the two different outputs did not match.

Here is the output:

library(qdap)
counts(polarity(DATA$state))[, "polarity"]

## > counts(polarity(DATA$state))[, "polarity"]
##  [1] -0.8164966 -0.4472136  0.0000000 -1.0000000  0.0000000  0.0000000  0.0000000
##  [8] -0.5773503  0.0000000  0.4082483  0.0000000
## Warning message:
## In polarity(DATA$state) : 
##   Some rows contain double punctuation.  Suggested use of `sentSplit` function.

In this case using strip does not matter. It may in more complex situations involving amplifiers, negators, negatives, and commas. Here is an example:

## > counts(polarity("Really, I hate it"))[, "polarity"]
## [1] -0.5
## > counts(polarity(strip("Really, I hate it")))[, "polarity"]
## [1] -0.9

see the documentation for more.

Looks like removing punctuation and other extras tricks polarity into thinking the vector is a single sentence:

SimplifyText <- function(x) {
  return(removePunctuation(removeNumbers(stripWhitespace(tolower(x))))) 
}
polarity(SimplifyText(DATA$state))$all$polarity
# Result (no warning)
 [1] -0.8165 -0.4472  0.0000 -1.0000  0.0000  0.0000  0.0000 -0.5774  0.0000
[10]  0.4082  0.0000 
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!