Suppose I have two matrices, each with two columns and differing numbers of row. I want to check and see which pairs of one matrix are in the other matrix. If these were one
Recreate your data:
a <- matrix(c(1, 2, 4, 9, 1, 6, 7, 7), ncol=2, byrow=TRUE)
x <- matrix(c(1, 6, 2, 7, 3, 8, 4, 9, 5, 10), ncol=2, byrow=TRUE)
Define a function %inm%
that is a matrix analogue to %in%
:
`%inm%` <- function(x, matrix){
test <- apply(matrix, 1, `==`, x)
any(apply(test, 2, all))
}
Apply this to your data:
apply(a, 1, `%inm%`, x)
[1] FALSE TRUE TRUE FALSE
To compare a single row:
a[1, ] %inm% x
[1] FALSE
a[2, ] %inm% x
[1] TRUE
Andrie's solution is perfectly fine. But if you have big matrices, you might want to try something else, based on recursion. If you work columnwise, you can cut down on the calculation time by excluding everything that doesn't match at the first position:
fastercheck <- function(x,matrix){
nc <- ncol(matrix)
rec.check <- function(r,i,id){
id[id] <- matrix[id,i] %in% r[i]
if(i<nc & any(id)) rec.check(r,i+1,id) else any(id)
}
apply(x,1,rec.check,1,rep(TRUE,nrow(matrix)))
}
The comparison :
> set.seed(100)
> x <- matrix(runif(1e6),ncol=10)
> a <- matrix(runif(300),ncol=10)
> a[c(3,7,9,15),] <- x[c(1000,48213,867,20459),]
> system.time(res1 <- a %inm% x)
user system elapsed
31.16 0.14 31.50
> system.time(res2 <- fastercheck(a,x))
user system elapsed
0.37 0.00 0.38
> identical(res1, res2)
[1] TRUE
> which(res2)
[1] 3 7 9 15
EDIT:
I checked the accepted answer just for fun. Performs better than the double apply ( as you get rid of the inner loop), but recursion still rules! ;-)
> system.time(apply(a, 1, paste, collapse="$$") %in%
+ apply(x, 1, paste, collapse="$$"))
user system elapsed
6.40 0.01 6.41
Coming in late to the game: I had previously written an algorithm using the "paste with delimiter" method, and then found this page. I was guessing that one of the code snippets here would be the fastest, but:
andrie<-function(mfoo,nfoo) apply(mfoo, 1, `%inm%`, nfoo)
# using Andrie's %inm% operator exactly as above
carl<-function(mfoo,nfoo) {
allrows<-unlist(sapply(1:nrow(mfoo),function(j) paste(mfoo[j,],collapse='_')))
allfoo <- unlist(sapply(1:nrow(nfoo),function(j) paste(nfoo[j,],collapse='_')))
thewalls<-setdiff(allrows,allfoo)
dowalls<-mfoo[allrows%in%thewalls,]
}
ramnath <- function (a,x) apply(a, 1, digest) %in% apply(x, 1, digest)
mfoo<-matrix( sample(1:100,400,rep=TRUE),nr=100)
nfoo<-mfoo[sample(1:100,60),]
library(microbenchmark)
microbenchmark(andrie(mfoo,nfoo),carl(mfoo,nfoo),ramnath(mfoo,nfoo),times=5)
Unit: milliseconds
expr min lq median uq max neval
andrie(mfoo, nfoo) 25.564196 26.527632 27.964448 29.687344 102.802004 5
carl(mfoo, nfoo) 1.020310 1.079323 1.096855 1.193926 1.246523 5
ramnath(mfoo, nfoo) 8.176164 8.429318 8.539644 9.258480 9.458608 5
So apparently constructing character strings and doing a single set operation is fastest! (PS I checked and all 3 algorithms give the same result)
Another approach would be:
> paste(a[,1], a[,2], sep="$$") %in% paste(x[,1], x[,2], sep="$$")
[1] FALSE TRUE TRUE FALSE
A more general version of this is:
> apply(a, 1, paste, collapse="$$") %in% apply(x, 1, paste, collapse="$$")
[1] FALSE TRUE TRUE FALSE
Here is another approach using the digest
package and creating checksums
for each row, which are generated using a hashing algorithm (the default being md5
)
a <- matrix(c(1, 2, 4, 9, 1, 6, 7, 7), ncol=2, byrow=TRUE)
x <- matrix(c(1, 6, 2, 7, 3, 8, 4, 9, 5, 10), ncol=2, byrow=TRUE)
apply(a, 1, digest) %in% apply(x, 1, digest)
[1] FALSE TRUE TRUE FALSE