问题
In order to find out whether data frame df.a
is a subset of data frame df.b
I did the following:
df.a <- data.frame( x=1:5, y=6:10 )
df.b <- data.frame( x=1:7, y=6:12 )
inds.x <- as.integer( lapply( df.a$x, function(x) which(df.b$x == x) ))
inds.y <- as.integer( lapply( df.a$y, function(y) which(df.b$y == y) ))
identical( inds.x, inds.y )
The last line gave TRUE
, hence df.a
is contained in df.b
.
Now I wonder whether there is a more elegant - and possibly more efficient - way to answer this question?
This task also is easily extended to find the intersection between two given data frames, possibly based on only a subset of columns.
Help will be much appreciated.
回答1:
I am going to hazard a guess at an answer.
I think semi_join
from dplyr
will do what you want, even taking into account duplicated rows.
First note the helpfile ?semi_join
:
return all rows from x where there are matching values in y, keeping just columns from x.
A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.
Ok, this suggests that the following should correctly fail:
df.a <- data.frame( x=c(1:5,1), y=c(6:10,6) )
df.b <- data.frame( x=1:7, y=6:12 )
identical(semi_join(df.b, df.a), semi_join(df.a, df.a))
which gives FALSE
, as expected since
> semi_join(df.b, df.a)
Joining by: c("x", "y")
x y
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
However, the following should pass:
df.c <- data.frame( x=c(1:7, 1), y= c(6:12, 6) )
identical(semi_join(df.c, df.a), semi_join(df.a, df.a))
and it does, giving TRUE
.
The second semi_join(df.a, df.a)
is required to get the canonical sorting on df.a
.
来源:https://stackoverflow.com/questions/29356496/r-how-to-efficiently-find-out-whether-data-frame-a-is-contained-in-data-frame-b