Retrieving sentence score based on values of words in a dictionary

后端 未结 2 1707
陌清茗
陌清茗 2021-01-13 08:34

Edited df and dict

I have a data frame containing sentences:

df <- data_frame(text = c(\"I love pandas         


        
相关标签:
2条回答
  • 2021-01-13 09:11

    Update : Here's the easiest dplyr method I've found so far. And I'll add a stringi function to speed things up. Provided there are no identical sentences in df$text, we can group by that column and then apply mutate()

    Note: Package versions are dplyr 0.4.1 and stringi 0.4.1

    library(dplyr)
    library(stringi)
    
    group_by(df, text) %>%
        mutate(score = sum(dict$score[stri_detect_fixed(text, dict$word)]))
    # Source: local data frame [2 x 2]
    # Groups: text
    #
    #             text score
    # 1  I love pandas     2
    # 2 I hate monkeys    -2
    

    I removed the do() method I posted last night, but you can find it in the edit history. To me it seems unnecessary since the above method works as well and is the more dplyr way to do it.

    Additionally, if you're open to a non-dplyr answer, here are two using base functions.

    total <- with(dict, {
        vapply(df$text, function(X) {
            sum(score[vapply(word, grepl, logical(1L), x = X, fixed = TRUE)])
        }, 1)
    })
    cbind(df, total)
    #             text total
    # 1  I love pandas     2
    # 2 I hate monkeys    -2
    

    Or an alternative using strsplit() produces the same result

    s <- strsplit(df$text, " ")
    total <- vapply(s, function(x) sum(with(dict, score[match(x, word, 0L)])), 1)
    cbind(df, total)
    
    0 讨论(0)
  • 2021-01-13 09:22

    A bit of double looping via sapply and gregexpr:

    res <- sapply(dict$word, function(x) {
      sapply(gregexpr(x,df$text),function(y) length(y[y!=-1]) )
    })
    rowSums(res * dict$score)
    #[1]  2 -2
    

    This also accounts for when there is multiple matches in a single string:

    df <- data.frame(text = c("I love love pandas", "I hate monkeys"))
    # run same code as above
    #[1]  3 -2
    
    0 讨论(0)
提交回复
热议问题