Fast way to create a binary matrix with known number of 1 each row in R

前端 未结 4 691
忘了有多久
忘了有多久 2021-01-28 23:10

I have a vector that provides how many \"1\" each row of a matrix has. Now I have to create this matrix out of the vector.

For example, let say I want to create a 4 x 9

相关标签:
4条回答
  • 2021-01-28 23:57

    Here is my approach using sapply and do.call and some timings on a small sample.

    library(microbenchmark)
    library(Matrix)
    
    v <- c(2,6,3,9)
        microbenchmark(
      roman = {
        xy <- sapply(v, FUN = function(x, ncols) {
          c(rep(1, x), rep(0, ncols - x))
        }, ncols = 9, simplify = FALSE)
    
        xy <- do.call("rbind", xy)
      },
      fourtytwo = {
        t(vapply(v, function(y) { x <- numeric( length=9); x[1:y] <- 1;x}, numeric(9) ) )
      },
      akrun = {
        m1 <- sparseMatrix(i = rep(seq_along(v), v), j = sequence(v), x = 1)
        m1 <- as.matrix(m1)
      })
    
    Unit: microseconds
          expr      min        lq       mean    median       uq
         roman   26.436   30.0755   36.42011   36.2055   37.930
     fourtytwo   43.676   47.1250   55.53421   54.7870   57.852
         akrun 1261.634 1279.8330 1501.81596 1291.5180 1318.720
    

    and for a bit larger sample

    v <- sample(2:9, size = 10e3, replace = TRUE)
    
    Unit: milliseconds
          expr      min       lq     mean   median       uq
         roman 33.52430 35.80026 37.28917 36.46881 37.69137
     fourtytwo 37.39502 40.10257 41.93843 40.52229 41.52205
         akrun 10.00342 10.34306 10.66846 10.52773 10.72638
    

    With a growing object size, the benefits of spareMatrix come to light.

    0 讨论(0)
  • 2021-01-29 00:10

    Update on 2016-11-24

    I got another solution when answering Ragged rowSums in R today:

    outer(v, 1:9, ">=") + 0L
    
    #     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
    #[1,]    1    1    0    0    0    0    0    0    0
    #[2,]    1    1    1    1    1    1    0    0    0
    #[3,]    1    1    1    0    0    0    0    0    0
    #[4,]    1    1    1    1    1    1    1    1    1
    

    This has the same memory usage to the f function in my initial answer, and it won't be any slower than f. Consider the benchmark in my original answer:

    microbenchmark(my_old = f(v, n), my_new = outer(v, n, ">=") + 0L, unit = "ms")
    
    #Unit: milliseconds
    #   expr      min       lq        mean    median        uq       max neval cld
    # my_old 109.3422 111.0355 121.0382120 111.16752 112.44472 210.36808   100   b
    # my_new   0.3094   0.3199   0.3691904   0.39816   0.40608   0.45556   100  a 
    

    Note how much faster this new method is, yet my old method is already the fastest among existing solutions (see below)!!!


    Original answer on 2016-11-07

    Here is my "awkward" solution:

    f <- function (v, n) {
      # n <- 9    ## total number of column
      # v <- c(2,6,3,9)  ## number of 1 each row
      u <- n - v   ## number of 0 each row
      m <- length(u)  ## number of rows
      d <- rep.int(c(1,0), m)  ## discrete value for each row
      asn <- rbind(v, u) ## assignment of `d`
      fill <- rep.int(d, asn)  ## matrix elements
      matrix(fill, byrow = TRUE, ncol = n)
      }
    
    n <- 9    ## total number of column
    v <- c(2,6,3,9)  ## number of 1 each row
    
    f(v, n)
    #     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
    #[1,]    1    1    0    0    0    0    0    0    0
    #[2,]    1    1    1    1    1    1    0    0    0
    #[3,]    1    1    1    0    0    0    0    0    0
    #[4,]    1    1    1    1    1    1    1    1    1
    

    We consider a benchmark of big problem size:

    n <- 500    ## 500 columns
    v <- sample.int(n, 10000, replace = TRUE)    ## 10000 rows
    
    microbenchmark(
      my_bad = f(v, n),
      roman = {
        xy <- sapply(v, FUN = function(x, ncols) {
          c(rep(1, x), rep(0, ncols - x))
        }, ncols = n, simplify = FALSE)
    
        do.call("rbind", xy)
      },
      fourtytwo = {
        t(vapply(v, function(y) { x <- numeric( length=n); x[1:y] <- 1;x}, numeric(n) ) )
      },
      akrun = {
        sparseMatrix(i = rep(seq_along(v), v), j = sequence(v), x = 1)
      },
      unit = "ms")
    
    #Unit: milliseconds
    #      expr      min       lq     mean   median       uq      max neval  cld
    #    my_bad 105.7507 118.6946 160.6818 138.5855 186.3762 327.3808   100 a   
    #     roman 176.9003 194.7467 245.0450 213.8680 305.9537 435.5974   100  b  
    # fourtytwo 235.0930 256.5129 307.3099 273.2280 358.8224 587.3256   100   c 
    #     akrun 316.7131 351.6184 408.5509 389.9576 456.0704 604.2667   100    d
    

    My method is in fact the fastest!!

    0 讨论(0)
  • 2021-01-29 00:11

    vapply is usually faster than sapply. This assigns the desired number of ones to a length-9 vector and then transposes.

    > t( vapply( c(2,6,3,9), function(y) { x <- numeric( length=9); x[1:y] <- 1;x}, numeric(9) ) )
         [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
    [1,]    1    1    0    0    0    0    0    0    0
    [2,]    1    1    1    1    1    1    0    0    0
    [3,]    1    1    1    0    0    0    0    0    0
    [4,]    1    1    1    1    1    1    1    1    1
    

    Less than 5 seconds on an old Mac.

     system.time( M <- t( vapply( sample(1:500, 100000, rep=TRUE), function(y) { x <- numeric( length=500); x[1:y] <- 1;x}, numeric(500) ) ) )
       user  system elapsed 
      3.531   1.208   4.676 
    
    0 讨论(0)
  • 2021-01-29 00:13

    One option is sparseMatrix from Matrix

    library(Matrix)
    m1 <- sparseMatrix(i = rep(seq_along(v), v), j = sequence(v), x = 1)
    m1
    #4 x 9 sparse Matrix of class "dgCMatrix"
    
    #[1,] 1 1 . . . . . . .
    #[2,] 1 1 1 1 1 1 . . .
    #[3,] 1 1 1 . . . . . .
    #[4,] 1 1 1 1 1 1 1 1 1
    

    This can be converted to matrix with as.matrix

    as.matrix(m1)
    
    0 讨论(0)
提交回复
热议问题