matrix %in% matrix

后端 未结 5 893
北恋
北恋 2020-12-08 23:24

Suppose I have two matrices, each with two columns and differing numbers of row. I want to check and see which pairs of one matrix are in the other matrix. If these were one

相关标签:
5条回答
  • 2020-12-08 23:29

    Recreate your data:

    a <- matrix(c(1, 2, 4, 9, 1, 6, 7, 7), ncol=2, byrow=TRUE)
    x <- matrix(c(1, 6, 2, 7, 3, 8, 4, 9, 5, 10), ncol=2, byrow=TRUE)
    

    Define a function %inm% that is a matrix analogue to %in%:

    `%inm%` <- function(x, matrix){
      test <- apply(matrix, 1, `==`, x)
      any(apply(test, 2, all))
    }
    

    Apply this to your data:

    apply(a, 1, `%inm%`, x)
    [1] FALSE  TRUE  TRUE FALSE
    

    To compare a single row:

    a[1, ] %inm% x
    [1] FALSE
    
    a[2, ] %inm% x
    [1] TRUE
    
    0 讨论(0)
  • 2020-12-08 23:31

    Andrie's solution is perfectly fine. But if you have big matrices, you might want to try something else, based on recursion. If you work columnwise, you can cut down on the calculation time by excluding everything that doesn't match at the first position:

    fastercheck <- function(x,matrix){
      nc <- ncol(matrix)
      rec.check <- function(r,i,id){
        id[id] <- matrix[id,i] %in% r[i]
        if(i<nc & any(id)) rec.check(r,i+1,id) else any(id)
      }
      apply(x,1,rec.check,1,rep(TRUE,nrow(matrix)))
    }
    

    The comparison :

    > set.seed(100)
    > x <- matrix(runif(1e6),ncol=10)
    > a <- matrix(runif(300),ncol=10)
    > a[c(3,7,9,15),] <- x[c(1000,48213,867,20459),]
    > system.time(res1 <- a %inm% x)
       user  system elapsed 
      31.16    0.14   31.50 
    > system.time(res2 <- fastercheck(a,x))
       user  system elapsed 
       0.37    0.00    0.38 
    > identical(res1, res2)
    [1] TRUE
    > which(res2)
    [1]  3  7  9 15
    

    EDIT:

    I checked the accepted answer just for fun. Performs better than the double apply ( as you get rid of the inner loop), but recursion still rules! ;-)

    > system.time(apply(a, 1, paste, collapse="$$") %in% 
     + apply(x, 1, paste, collapse="$$"))
       user  system elapsed 
       6.40    0.01    6.41 
    
    0 讨论(0)
  • 2020-12-08 23:42

    Coming in late to the game: I had previously written an algorithm using the "paste with delimiter" method, and then found this page. I was guessing that one of the code snippets here would be the fastest, but:

    andrie<-function(mfoo,nfoo) apply(mfoo, 1, `%inm%`, nfoo)
    # using Andrie's %inm% operator exactly as above
    carl<-function(mfoo,nfoo) {
     allrows<-unlist(sapply(1:nrow(mfoo),function(j) paste(mfoo[j,],collapse='_'))) 
     allfoo <- unlist(sapply(1:nrow(nfoo),function(j) paste(nfoo[j,],collapse='_')))
     thewalls<-setdiff(allrows,allfoo)
     dowalls<-mfoo[allrows%in%thewalls,]
    }
    
     ramnath <- function (a,x) apply(a, 1, digest) %in% apply(x, 1, digest)
    
     mfoo<-matrix( sample(1:100,400,rep=TRUE),nr=100)
     nfoo<-mfoo[sample(1:100,60),]
    
     library(microbenchmark)
     microbenchmark(andrie(mfoo,nfoo),carl(mfoo,nfoo),ramnath(mfoo,nfoo),times=5)
    
    Unit: milliseconds
                    expr       min        lq    median        uq            max neval
      andrie(mfoo, nfoo) 25.564196 26.527632 27.964448 29.687344     102.802004     5
        carl(mfoo, nfoo)  1.020310  1.079323  1.096855  1.193926       1.246523     5
     ramnath(mfoo, nfoo)  8.176164  8.429318  8.539644  9.258480       9.458608     5
    

    So apparently constructing character strings and doing a single set operation is fastest! (PS I checked and all 3 algorithms give the same result)

    0 讨论(0)
  • 2020-12-08 23:44

    Another approach would be:

    > paste(a[,1], a[,2], sep="$$") %in% paste(x[,1], x[,2], sep="$$")
    [1] FALSE  TRUE  TRUE FALSE
    

    A more general version of this is:

    > apply(a, 1, paste, collapse="$$") %in% apply(x, 1, paste, collapse="$$")
    [1] FALSE  TRUE  TRUE FALSE
    
    0 讨论(0)
  • 2020-12-08 23:50

    Here is another approach using the digest package and creating checksums for each row, which are generated using a hashing algorithm (the default being md5)

    a <- matrix(c(1, 2, 4, 9, 1, 6, 7, 7), ncol=2, byrow=TRUE)
    x <- matrix(c(1, 6, 2, 7, 3, 8, 4, 9, 5, 10), ncol=2, byrow=TRUE)
    apply(a, 1, digest) %in% apply(x, 1, digest)
    
    [1] FALSE  TRUE  TRUE FALSE
    
    0 讨论(0)
提交回复
热议问题