Exact Matching text with dataframe column in r

送分小仙女□ 提交于 2020-04-17 22:54:30

问题


I have a vector of words in R:

words = c("Awesome","Loss","Good","Bad")

And I have the following dataframe in R:

df <- data.frame(ID = c(1,2,3),
                 Response = c("Today is an awesome day", 
                              "Yesterday was a bad day,but today it is good",
                              "I have losses today"))

What I want to do is words that are exact matching in Response column should be extracted and inserted into new column in dataframe. Final output should look like this

ID           Response                        Match          
1            Today is an awesome day        Awesome           
2            Yesterday was a bad day        Bad,Good           
             ,but today it is good      
3            I have losses today            NA

I used the following code:

extract the list of matching words

x <- sapply(words, function(x) grepl(tolower(x), tolower(df$Response)))

paste the matching words together

df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))

But it is providing the match, but not the exact. Please help.


回答1:


If you use anchors in your words vector, you will ensure exact matches: ^ asserts that you're at the start, $ that you're at the end of a word. So:

words = c("Awesome","^Loss$","Good","Bad")

Then use your code:

x <- sapply(words, function(x) grepl(tolower(x), tolower(df$Response)))
df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))

which gives:

> df
  ID                                     Response    Words
1  1                      Today is an awesome day  Awesome
2  2 Yesterday was a bad day,but today it is good Good,Bad
3  3                          I have losses today  

To turn blanks to NA:

df$Words[df$Words == ""] <- NA



回答2:


We can use str_extract_all

library(stringr)
library(dplyr)
library(purrr)
df %>%
    mutate(Words = map_chr(str_extract_all(Response, str_c("
       (?i)\\b(", str_c(words, collapse="|"), ")\\b")), toString))
#   ID                                     Response     Words
#1  1                      Today is an awesome day   awesome
#2  2 Yesterday was a bad day,but today it is good bad, good
#3  3                          I have losses today          

data

words <- c("Awesome","Loss","Good","Bad")



回答3:


Change the first *apply function to a two lines function. If the regex becomes "\\bword\\b" then it captures the word surrounded by boundaries.

x <- sapply(words, function(x) {
  y <- paste0("\\b", x, "\\b")
  grepl(tolower(y), tolower(df$Response))
})

Now run the second apply as posted in the question.

df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))

df
#  ID                                     Response    Words
#1  1                      Today is an awesome day  Awesome
#2  2 Yesterday was a bad day,but today it is good Good,Bad
#3  3                          I have losses today       

As for the NA's, I will use function is.na<-.

is.na(df$Words) <- df$Words == ""

Data.

df <- read.table(text = "
ID           Response
1            'Today is an awesome day'
2            'Yesterday was a bad day,but today it is good'
3            'I have losses today'
", header = TRUE)

words <- c("Awesome","Loss","Good","Bad")


来源:https://stackoverflow.com/questions/61160426/exact-matching-text-with-dataframe-column-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!