dplyr filtering on multiple columns using “%in%”

问题

I have a dataframe (df1) with multiple columns (ID, Number, Location, Field, Weight). I also have another dataframe (df2) with more information (ID, PassRate, Number, Weight).

I am trying to use dplyr and %in% to filter out rows in df1 that have the same two values as df2.

So far I have:

df_sub <- subset(df1, df1$ID %in% df2$ID & df1$Weight %in% df2$Weight)

But this is only subsetting on the first condition...any idea why?

回答1:

From the question and sample code, it is unclear whether you want df_sub to contain the rows in df1 which do have matches in df2, or the ones without matches. dplyr::semi_join() will return the rows with matches, dplyr::anti_join() will return the rows without matches.

df_sub <- semi_join(x=df1, y=df2, by=c("ID","Weight"))

df_sub <- anti_join(x=df1, y=df2, by=c("ID","Weight"))

回答2:

Try this,

df1[paste0(df1$ID, df1$Weight) %in% paste0(df2$ID, df2$Weight), ]

what you are doing is filter the df1 by df2 value , not find the row match

Try this sample data

df1 
ID  Weight
1   a
2   b


df2 
ID  Weight
1   b
2   a

Using your function

 df_sub <- subset(df1, df1$ID %in% df2$ID & df1$Weight %in% df2$Weight)


> df_sub
  ID Weight
1  2      b
2  1      a

Actually , it give back the Boolean like below which cause all df1 value show up on df2 :

 True  True
 True  True

using mine, the result is no one match :

 df1[paste0(df1$ID, df1$Weight) %in% paste0(df2$ID, df2$Weight), ]

[1] ID     Weight
<0 rows> (or 0-length row.names)

来源：https://stackoverflow.com/questions/45623451/dplyr-filtering-on-multiple-columns-using-in

标签

statistics

dplyr