How to take a word and create an indicator variable based on the word's presence in comments?

喜你入骨 提交于 2019-12-12 18:01:43

问题


I have a vector of words and a a vector of comments:

word.list <- c("very", "experience", "glad")

comments  <- c("very good experience. first time I have been and I would definitely come back.",
               "glad I scheduled an appointment.",
               "the staff have become more cordial.",
               "the experience i had was not good at all.",
               "i am very glad")

I would like to create a data frame that looks like

df <- data.frame(comments = c("very good experience. first time I have been and I would definitely come back.",
               "glad I scheduled an appointment.",
               "the staff have become more cordial.",
               "the experience i had was not good at all.",
               "i am very glad"),
               very = c(1,0,0,0,1),
               glad = c(0,1,0,0,1),
               experience = c(1,0,0,1,0))

I have 12,000+ comments and 20 words I would like to do this with. How do I go about doing this efficiently? For loops? Any other method?


回答1:


Loop through word.list and use grepl:

sapply(word.list, function(i) as.numeric(grepl(i, comments)))

To have pretty output, convert to a dataframe:

data.frame(comments, sapply(word.list, function(i) as.numeric(grepl(i, comments))))

Note: grepl will match "very" with "veryX". If this is not desired then this needs complete word matching.

# To avoid matching "very" with "veryX"
sapply(word.list, function(i) as.numeric(grepl(paste0("\\b", i, "\\b"), comments)))



回答2:


One way is a combination of stringi and gdapTools package, i.e.

library(stringi)
library(qdapTools)

mtabulate(stri_extract_all(comments, regex = paste(word.list, collapse = '|')))
#  experience glad very
#1          1    0    1
#2          0    1    0
#3          0    0    0
#4          1    0    0
#5          0    1    1

You can then use cbind or data.frame to bind,

cbind(comments, mtabulate(stri_extract_all(comments, regex = paste(word.list, collapse = '|'))))) 



回答3:


Using base-R, this code will loop through the list of words and each comment, and check whether each word exists among the split comment (splitting by spaces and punctuation marks), then recombining as a data frame...

df <- as.data.frame(do.call(cbind,lapply(word.list,function(w) 
          as.numeric(sapply(comments,function(v) w %in% unlist(strsplit(v,"[ \\.,]")))))))
names(df) <- word.list
df <- cbind(comments,df)

df
                                                                        comments very experience glad
1 very good experience. first time I have been and I would definitely come back.    1          1    0
2                                               glad I scheduled an appointment.    0          0    1
3                                            the staff have become more cordial.    0          0    0
4                                      the experience i had was not good at all.    0          1    0
5                                                                 i am very glad    1          0    1


来源:https://stackoverflow.com/questions/43658614/how-to-take-a-word-and-create-an-indicator-variable-based-on-the-words-presence

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!