问题
I am trying to find multiple strings in my dataframe, using the which function. I am trying to extend the answer from Find string in data.frame
An example dataframe is:
df1 <- data.frame(animal=c('a','b','c','two', 'five', 'c'), level=c('five','one','three',30,'horse', 'five'), length=c(10, 20, 30, 'horse', 'eight', 'c'))
1 a five 10
2 b one 20
3 c three 30
4 two 30 horse
5 five horse eight
6 c five c
on this dataframe when I apply the which function for one string, I get the correct output e.g.
which(df1 =="c" , arr.ind = T);df1
gives:
row col
[1,] 3 1
[2,] 6 1
[3,] 6 3
But when I try to search for multiple strings, I get only a partially correct output e.g.
which(df1 ==c("c", "horse", "five") , arr.ind = T)
row col
[1,] 5 2
[2,] 6 2
The expected output should be:
row col
[1,] 3 1
[2,] 5 1
[3,] 6 1
[4,] 1 2
[5,] 5 2
[6,] 6 2
[7,] 4 3
[8,] 6 3
Hence my question:
why does the solution with c("c", "horse", "five") not work?
I have tried with
which(df1=="c" | df1=="horse" | df1 =="five", arr.ind = T)
that gives me the correct output, but for many strings is too lengthy, how can I make my code succinct?
回答1:
We can loop through the vector with lapply
, do the ==
, Reduce
it to single logical matrix with |
and wrap with which
which(Reduce(`|`, lapply(c("c", "horse", "five"), `==`, df1)), arr.ind = TRUE)
# row col
#[1,] 3 1
#[2,] 5 1
#[3,] 6 1
#[4,] 1 2
#[5,] 5 2
#[6,] 6 2
#[7,] 4 3
#[8,] 6 3
Or another option would be to loop through the columns of dataset with mutate_all
and wrap with which
library(dplyr)
df1 %>%
mutate_all(list(~ . %in% c("c", "horse", "five"))) %>%
as.matrix %>%
which(., arr.ind = TRUE)
NOTE: Here, we don't need any regex or partial matches if the OP wanted to do a full string match. It should be faster than doing any partial matches
Usually, for multiple elements %in%
would be useful, but, it works only on a vector and not a data.frame
回答2:
Since you have multiple values you cannot directly compare them in a dataframe. One way is to use sapply
with grepl
by creating word boundaries and check if the pattern is present in any of the columns and then use which
to get row and column indices.
vals <- c("c", "horse", "five")
which(sapply(df1, grepl, pattern = paste0("\\b", vals, "\\b", collapse = "|")),
arr.ind = TRUE)
# row col
#[1,] 3 1
#[2,] 5 1
#[3,] 6 1
#[4,] 1 2
#[5,] 5 2
#[6,] 6 2
#[7,] 4 3
#[8,] 6 3
来源:https://stackoverflow.com/questions/56583161/find-multiple-strings-in-entire-dataframe