Retrieving sentence score based on values of words in a dictionary

删除回忆录丶 提交于 2019-12-01 06:33:49

Update : Here's the easiest dplyr method I've found so far. And I'll add a stringi function to speed things up. Provided there are no identical sentences in df$text, we can group by that column and then apply mutate()

Note: Package versions are dplyr 0.4.1 and stringi 0.4.1

library(dplyr)
library(stringi)

group_by(df, text) %>%
    mutate(score = sum(dict$score[stri_detect_fixed(text, dict$word)]))
# Source: local data frame [2 x 2]
# Groups: text
#
#             text score
# 1  I love pandas     2
# 2 I hate monkeys    -2

I removed the do() method I posted last night, but you can find it in the edit history. To me it seems unnecessary since the above method works as well and is the more dplyr way to do it.

Additionally, if you're open to a non-dplyr answer, here are two using base functions.

total <- with(dict, {
    vapply(df$text, function(X) {
        sum(score[vapply(word, grepl, logical(1L), x = X, fixed = TRUE)])
    }, 1)
})
cbind(df, total)
#             text total
# 1  I love pandas     2
# 2 I hate monkeys    -2

Or an alternative using strsplit() produces the same result

s <- strsplit(df$text, " ")
total <- vapply(s, function(x) sum(with(dict, score[match(x, word, 0L)])), 1)
cbind(df, total)

A bit of double looping via sapply and gregexpr:

res <- sapply(dict$word, function(x) {
  sapply(gregexpr(x,df$text),function(y) length(y[y!=-1]) )
})
rowSums(res * dict$score)
#[1]  2 -2

This also accounts for when there is multiple matches in a single string:

df <- data.frame(text = c("I love love pandas", "I hate monkeys"))
# run same code as above
#[1]  3 -2
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!