问题
I have data from an open ended survey. I have a comments table and a codes table. The codes table is a set of themes or strings.
What I am trying to do: Check to see if a word / string exists from the relevant column in the codes table is in an open ended comment. Add a new column in the comments table for the specific theme and a binary 1 or 0 to denote what records have been tagged.
There are quite a number of columns in the codes table, these are live and ever changing, column orders and number of columns subject to change.
I am currently doing this in a rather convoluted way, I am checking each column individually with multiple lines of code and I reckon there is likely a much better way of doing it.
I can't figure out how to get lapply to work with the stringi function.
Help is greatly appreciated.
Here is an example set of code so you can see what I am trying to do:
#Two tables codes and comments
#codes table
codes <- structure(
list(
Support = structure(
c(2L, 3L, NA),
.Label = c("",
"help", "questions"),
class = "factor"
),
Online = structure(
c(1L,
3L, 2L),
.Label = c("activities", "discussion board", "quiz"),
class = "factor"
),
Resources = structure(
c(3L, 2L, NA),
.Label = c("", "pdf",
"textbook"),
class = "factor"
)
),
row.names = c(NA,-3L),
class = "data.frame"
)
#comments table
comments <- structure(
list(
SurveyID = structure(
1:5,
.Label = c("ID_1", "ID_2",
"ID_3", "ID_4", "ID_5"),
class = "factor"
),
Open_comments = structure(
c(2L,
4L, 3L, 5L, 1L),
.Label = c(
"I could never get the pdf to download",
"I didn’t get the help I needed on time",
"my questions went unanswered",
"staying motivated to get through the textbook",
"there wasn’t enough engagement in the discussion board"
),
class = "factor"
)
),
class = "data.frame",
row.names = c(NA,-5L)
)
#check if any words from the columns in codes table match comments
#here I am looking for a match column by column but looking for a better way - lappy?
support = paste(codes$Support, collapse = "|")
supp_stringi = stri_detect_regex(comments$Open_comments, support)
supp_grepl = grepl(pattern = support, x = comments$Open_comments)
identical(supp_stringi, supp_grepl)
comments$Support = ifelse(supp_grepl == TRUE, 1, 0)
# What I would like to do is loop through all columns in codes rather than outlining the above code for each column in codes
回答1:
Here is an approach that uses string::stri_detect_regex()
with lapply()
to create vectors of TRUE = 1, FALSE = 0 depending on whether any of the words in the Support
, Online
or Resources
vectors are in the comments, and merges this data back with the comments.
# build data structures from OP
resultsList <- lapply(1:ncol(codes),function(x){
y <- stri_detect_regex(comments$Open_comments,paste(codes[[x]],collapse = "|"))
ifelse(y == TRUE,1,0)
})
results <- as.data.frame(do.call(cbind,resultsList))
colnames(results) <- colnames(codes)
mergedData <- cbind(comments,results)
mergedData
...and the results.
> mergedData
SurveyID Open_comments Support Online
1 ID_1 I didn’t get the help I needed on time 1 0
2 ID_2 staying motivated to get through the textbook 0 0
3 ID_3 my questions went unanswered 1 0
4 ID_4 there wasn’t enough engagement in the discussion board 0 1
5 ID_5 I could never get the pdf to download 0 0
Resources
1 0
2 1
3 0
4 0
5 1
>
回答2:
One liner using base R :
comments[names(codes)] <- lapply(codes, function(x)
+(grepl(paste0(na.omit(x), collapse = "|"), comments$Open_comments)))
comments
# SurveyID Open_comments Support Online Resources
#1 ID_1 I didn’t get the help I needed on time 1 0 0
#2 ID_2 staying motivated to get through the textbook 0 0 1
#3 ID_3 my questions went unanswered 1 0 0
#4 ID_4 there wasn’t enough engagement in the discussion board 0 1 0
#5 ID_5 I could never get the pdf to download 0 0 1
来源:https://stackoverflow.com/questions/61354445/matching-strings-loop-over-multiple-columns