Edited df
and dict
I have a data frame containing sentences:
df <- data_frame(text = c("I love pandas", "I hate monkeys", "pandas pandas pandas", "monkeys monkeys"))
And a dictionary containing words and their corresponding scores:
dict <- data_frame(word = c("love", "hate", "pandas", "monkeys"),
score = c(1,-1,1,-1))
I want to append a column "score" to df
that would sum the score for each sentence:
Expected results
text score
1 I love pandas 2
2 I hate monkeys -2
3 pandas pandas pandas 3
4 monkeys monkeys -2
Update
Here are the results so far:
Akrun's methods
Suggestion 1
df %>% mutate(score = sapply(strsplit(text, ' '), function(x) with(dict, sum(score[word %in% x]))))
Note that for this method to work, I had to use data_frame()
to create df
and dict
instead of data.frame()
otherwise I get: Error in strsplit(text, " ") : non-character argument
Source: local data frame [4 x 2]
text score
1 I love pandas 2
2 I hate monkeys -2
3 pandas pandas pandas 1
4 monkeys monkeys -1
This does not accounts for multiple matches in a single string. Close to expected result, but not quite there yet.
Suggestion 2
I tweaked a bit one of akrun's suggestion in the comments to apply it to the edited post
cbind(df, unnest(stri_split_fixed(df$text, ' '), group) %>%
group_by(group) %>%
summarise(score = sum(dict$score[dict$word %in% x])) %>%
ungroup() %>% select(-group) %>% data.frame())
This does not account for multiple matches in a string:
text score
1 I love pandas 2
2 I hate monkeys -2
3 pandas pandas pandas 1
4 monkeys monkeys -1
Richard Scriven's methods
Suggestion 1
group_by(df, text) %>%
mutate(score = sum(dict$score[stri_detect_fixed(text, dict$word)]))
After updating all packages, this now works (although it does not account for multiple matches)
Source: local data frame [4 x 2]
Groups: text
text score
1 I love pandas 2
2 I hate monkeys -2
3 pandas pandas pandas 1
4 monkeys monkeys -1
Suggestion 2
total <- with(dict, {
vapply(df$text, function(X) {
sum(score[vapply(word, grepl, logical(1L), x = X, fixed = TRUE)])
}, 1)
})
cbind(df, total)
This give the same results:
text total
1 I love pandas 2
2 I hate monkeys -2
3 pandas pandas pandas 1
4 monkeys monkeys -1
Suggestion 3
s <- strsplit(df$text, " ")
total <- vapply(s, function(x) sum(with(dict, score[match(x, word, 0L)])), 1)
cbind(df, total)
This actually works:
text total
1 I love pandas 2
2 I hate monkeys -2
3 pandas pandas pandas 3
4 monkeys monkeys -2
Thelatemail's method
res <- sapply(dict$word, function(x) {
sapply(gregexpr(x,df$text),function(y) length(y[y!=-1]) )
})
cbind(df, score = rowSums(res * dict$score))
Note that I added the cbind()
part. This actually match the expected result.
text score
1 I love pandas 2
2 I hate monkeys -2
3 pandas pandas pandas 3
4 monkeys monkeys -2
Final answer
Inspired by akrun's suggestion, here is what I ended up writing as the most dplyr
-esque solution:
library(dplyr)
library(tidyr)
library(stringi)
bind_cols(df, unnest(stri_split_fixed(df$text, ' '), group) %>%
group_by(x) %>% mutate(score = sum(dict$score[dict$word %in% x])) %>%
group_by(group) %>%
summarise(score = sum(score)) %>%
select(-group))
Although I will implement Richard Scriven's suggestion #3 since it's the most efficient.
Benchmark
Here are the suggestions applied to much larger datasets (df
of 93 sentences and dict
of 14K words) using microbenchmark()
:
mbm = microbenchmark(
akrun = df %>% mutate(score = sapply(stri_detect_fixed(text, ' '), function(x) with(dict, sum(score[word %in% x])))),
akrun2 = cbind(df, unnest(stri_split_fixed(df$text, ' '), group) %>% group_by(group) %>% summarise(score = sum(dict$score[dict$word %in% x])) %>% ungroup() %>% select(-group) %>% data.frame()),
rscriven1 = group_by(df, text) %>% mutate(score = sum(dict$score[stri_detect_fixed(text, dict$word)])),
rscriven2 = cbind(df, score = with(dict, { vapply(df$text, function(X) { sum(score[vapply(word, grepl, logical(1L), x = X, fixed = TRUE)])}, 1)})),
rscriven3 = cbind(df, score = vapply(strsplit(df$text, " "), function(x) sum(with(dict, score[match(x, word, 0L)])), 1)),
thelatemail = cbind(df, score = rowSums(sapply(dict$word, function(x) { sapply(gregexpr(x,df$text),function(y) length(y[y!=-1]) ) }) * dict$score)),
sbeaupre = bind_cols(df, unnest(stri_split_fixed(df$text, ' '), group) %>% group_by(x) %>% mutate(score = sum(dict$score[dict$word %in% x])) %>% group_by(group) %>% summarise(score = sum(score)) %>% select(-group)),
times = 10
)
And the results:
Update : Here's the easiest dplyr
method I've found so far. And I'll add a stringi
function to speed things up. Provided there are no identical sentences in df$text
, we can group by that column and then apply mutate()
Note: Package versions are dplyr 0.4.1 and stringi 0.4.1
library(dplyr)
library(stringi)
group_by(df, text) %>%
mutate(score = sum(dict$score[stri_detect_fixed(text, dict$word)]))
# Source: local data frame [2 x 2]
# Groups: text
#
# text score
# 1 I love pandas 2
# 2 I hate monkeys -2
I removed the do()
method I posted last night, but you can find it in the edit history. To me it seems unnecessary since the above method works as well and is the more dplyr
way to do it.
Additionally, if you're open to a non-dplyr
answer, here are two using base functions.
total <- with(dict, {
vapply(df$text, function(X) {
sum(score[vapply(word, grepl, logical(1L), x = X, fixed = TRUE)])
}, 1)
})
cbind(df, total)
# text total
# 1 I love pandas 2
# 2 I hate monkeys -2
Or an alternative using strsplit()
produces the same result
s <- strsplit(df$text, " ")
total <- vapply(s, function(x) sum(with(dict, score[match(x, word, 0L)])), 1)
cbind(df, total)
A bit of double looping via sapply
and gregexpr
:
res <- sapply(dict$word, function(x) {
sapply(gregexpr(x,df$text),function(y) length(y[y!=-1]) )
})
rowSums(res * dict$score)
#[1] 2 -2
This also accounts for when there is multiple matches in a single string:
df <- data.frame(text = c("I love love pandas", "I hate monkeys"))
# run same code as above
#[1] 3 -2
来源:https://stackoverflow.com/questions/28059939/retrieving-sentence-score-based-on-values-of-words-in-a-dictionary