Retrieving sentence score based on values of words in a dictionary

后端未结

关注

 2  1707

Edited df and dict

I have a data frame containing sentences:

df <- data_frame(text = c(\"I love pandas


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  清歌不尽        
                
              
                            
                2021-01-13 09:11
              
            
            
                                                                       
Update : Here's the easiest dplyr method I've found so far. And I'll add a stringi function to speed things up. Provided there are no identical sentences in df$text, we can group by that column and then apply mutate()

Note: Package versions are dplyr 0.4.1 and stringi 0.4.1

library(dplyr)
library(stringi)

group_by(df, text) %>%
    mutate(score = sum(dict$score[stri_detect_fixed(text, dict$word)]))
# Source: local data frame [2 x 2]
# Groups: text
#
#             text score
# 1  I love pandas     2
# 2 I hate monkeys    -2


I removed the do() method I posted last night, but you can find it in the edit history.  To me it seems unnecessary since the above method works as well and is the more dplyr way to do it.

Additionally, if you're open to a non-dplyr answer, here are two using base functions.

total <- with(dict, {
    vapply(df$text, function(X) {
        sum(score[vapply(word, grepl, logical(1L), x = X, fixed = TRUE)])
    }, 1)
})
cbind(df, total)
#             text total
# 1  I love pandas     2
# 2 I hate monkeys    -2


Or an alternative using strsplit() produces the same result

s <- strsplit(df$text, " ")
total <- vapply(s, function(x) sum(with(dict, score[match(x, word, 0L)])), 1)
cbind(df, total)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  没有蜡笔的小新        
                
              
                            
                2021-01-13 09:22
              
            
            
                                                                       
A bit of double looping via sapply and gregexpr:

res <- sapply(dict$word, function(x) {
  sapply(gregexpr(x,df$text),function(y) length(y[y!=-1]) )
})
rowSums(res * dict$score)
#[1]  2 -2


This also accounts for when there is multiple matches in a single string:

df <- data.frame(text = c("I love love pandas", "I hate monkeys"))
# run same code as above
#[1]  3 -2

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复