Exact Matching text with dataframe column in r

问题

I have a vector of words in R:

words = c("Awesome","Loss","Good","Bad")

And I have the following dataframe in R:

df <- data.frame(ID = c(1,2,3),
                 Response = c("Today is an awesome day", 
                              "Yesterday was a bad day,but today it is good",
                              "I have losses today"))

What I want to do is words that are exact matching in Response column should be extracted and inserted into new column in dataframe. Final output should look like this

ID           Response                        Match          
1            Today is an awesome day        Awesome           
2            Yesterday was a bad day        Bad,Good           
             ,but today it is good      
3            I have losses today            NA

I used the following code:

extract the list of matching words

x <- sapply(words, function(x) grepl(tolower(x), tolower(df$Response)))

paste the matching words together

df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))

But it is providing the match, but not the exact. Please help.

回答1:

If you use anchors in your words vector, you will ensure exact matches: ^ asserts that you're at the start, $ that you're at the end of a word. So:

words = c("Awesome","^Loss$","Good","Bad")

Then use your code:

x <- sapply(words, function(x) grepl(tolower(x), tolower(df$Response)))
df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))

which gives:

> df
  ID                                     Response    Words
1  1                      Today is an awesome day  Awesome
2  2 Yesterday was a bad day,but today it is good Good,Bad
3  3                          I have losses today

To turn blanks to NA:

df$Words[df$Words == ""] <- NA

回答2:

We can use str_extract_all

library(stringr)
library(dplyr)
library(purrr)
df %>%
    mutate(Words = map_chr(str_extract_all(Response, str_c("
       (?i)\\b(", str_c(words, collapse="|"), ")\\b")), toString))
#   ID                                     Response     Words
#1  1                      Today is an awesome day   awesome
#2  2 Yesterday was a bad day,but today it is good bad, good
#3  3                          I have losses today

data

words <- c("Awesome","Loss","Good","Bad")

回答3:

Change the first *apply function to a two lines function. If the regex becomes "\\bword\\b" then it captures the word surrounded by boundaries.

x <- sapply(words, function(x) {
  y <- paste0("\\b", x, "\\b")
  grepl(tolower(y), tolower(df$Response))
})

Now run the second apply as posted in the question.

df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))

df
#  ID                                     Response    Words
#1  1                      Today is an awesome day  Awesome
#2  2 Yesterday was a bad day,but today it is good Good,Bad
#3  3                          I have losses today

As for the NA's, I will use function is.na<-.

is.na(df$Words) <- df$Words == ""

Data.

df <- read.table(text = "
ID           Response
1            'Today is an awesome day'
2            'Yesterday was a bad day,but today it is good'
3            'I have losses today'
", header = TRUE)

words <- c("Awesome","Loss","Good","Bad")

来源：https://stackoverflow.com/questions/61160426/exact-matching-text-with-dataframe-column-in-r

标签

exact-match