Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply

前端 未结 4 1469
忘了有多久
忘了有多久 2020-12-02 19:28

I know there are many questions here in SO about ways to convert a list of data.frames to a single data.frame using do.call or ldply, but this questions is about understandi

相关标签:
4条回答
  • 2020-12-02 19:29

    rbind.data.frame does a lot of checking you don't need. This should be a pretty quick transformation if you only do exactly what you want.

    # Use data from Josh O'Brien's post.
    set.seed(21)
    X <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
    system.time({
    Names <- names(X[[1]])  # Get data.frame names from first list element.
    # For each name, extract its values from each data.frame in the list.
    # This provides a list with an element for each name.
    Xb <- lapply(Names, function(x) unlist(lapply(X, `[[`, x)))
    names(Xb) <- Names          # Give Xb the correct names.
    Xb.df <- as.data.frame(Xb)  # Convert Xb to a data.frame.
    })
    #    user  system elapsed 
    #   3.356   0.024   3.388 
    system.time(X1 <- do.call(rbind, X))
    #    user  system elapsed 
    # 169.627   6.680 179.675
    identical(X1,Xb.df)
    # [1] TRUE
    

    Inspired by the data.table answer, I decided to try and make this even faster. Here's my updated solution, to try and keep the check mark. ;-)

    # My "rbind list" function
    rbl.ju <- function(x) {
      u <- unlist(x, recursive=FALSE)
      n <- names(u)
      un <- unique(n)
      l <- lapply(un, function(N) unlist(u[N==n], FALSE, FALSE))
      names(l) <- un
      d <- as.data.frame(l)
    }
    # simple wrapper to rbindlist that returns a data.frame
    rbl.dt <- function(x) {
      as.data.frame(rbindlist(x))
    }
    
    library(data.table)
    if(packageVersion("data.table") >= '1.8.2') {
      system.time(dt <- rbl.dt(X))  # rbindlist only exists in recent versions
    }
    #    user  system elapsed 
    #    0.02    0.00    0.02
    system.time(ju <- rbl.ju(X))
    #    user  system elapsed 
    #    0.05    0.00    0.05 
    identical(dt,ju)
    # [1] TRUE
    
    0 讨论(0)
  • 2020-12-02 19:36

    Your observation that the time taken increases exponentially with the number of data.frames suggests that breaking the rbinding into two stages could speed things up.

    This simple experiment seems to confirm that that's a very fruitful path to take:

    ## Make a list of 50,000 data.frames
    X <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
    
    ## First, rbind together all 50,000 data.frames in a single step
    system.time({
        X1 <- do.call(rbind, X)
    })
    #    user  system elapsed 
    # 137.08   57.98  200.08 
    
    
    ## Doing it in two stages cuts the processing time by >95%
    ##   - In Stage 1, 100 groups of 500 data.frames are rbind'ed together
    ##   - In Stage 2, the resultant 100 data.frames are rbind'ed
    system.time({
        X2 <- lapply(1:100, function(i) do.call(rbind, X[((i*500)-499):(i*500)]))
        X3 <- do.call(rbind, X2)
    }) 
    #    user  system elapsed 
    #    6.14    0.05    6.21 
    
    
    ## Checking that the results are the same
    identical(X1, X3)
    # [1] TRUE
    
    0 讨论(0)
  • 2020-12-02 19:39

    Given that you are looking for performance, it appears that a data.table solution should be suggested.

    There is a function rbindlist which is the same but much faster than do.call(rbind, list)

    library(data.table)
    X <- replicate(50000, data.table(a=rnorm(5), b=1:5), simplify=FALSE)
    system.time(rbindlist.data.table <- rbindlist(X))
    ##  user  system elapsed 
    ##  0.00    0.01    0.02
    

    It is also very fast for a list of data.frame

    Xdf <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
    
    system.time(rbindlist.data.frame <- rbindlist(Xdf))
    ##  user  system elapsed 
    ##  0.03    0.00    0.03
    

    For comparison

    system.time(docall <- do.call(rbind, Xdf))
    ##  user  system elapsed 
    ## 50.72    9.89   60.88 
    

    And some proper benchmarking

    library(rbenchmark)
    benchmark(rbindlist.data.table = rbindlist(X), 
               rbindlist.data.frame = rbindlist(Xdf),
               docall = do.call(rbind, Xdf),
               replications = 5)
    ##                   test replications elapsed    relative user.self sys.self 
    ## 3               docall            5  276.61 3073.444445    264.08     11.4 
    ## 2 rbindlist.data.frame            5    0.11    1.222222      0.11      0.0 
    ## 1 rbindlist.data.table            5    0.09    1.000000      0.09      0.0 
    

    and against @JoshuaUlrich's solutions

    benchmark(use.rbl.dt  = rbl.dt(X), 
              use.rbl.ju  = rbl.ju (Xdf),
              use.rbindlist =rbindlist(X) ,
              replications = 5)
    
    ##              test replications elapsed relative user.self 
    ## 3  use.rbindlist            5    0.10      1.0      0.09
    ## 1     use.rbl.dt            5    0.10      1.0      0.09
    ## 2     use.rbl.ju            5    0.33      3.3      0.31 
    

    I'm not sure you really need to use as.data.frame, because a data.table inherits class data.frame

    0 讨论(0)
  • 2020-12-02 19:43

    You have a list of data.frames that each have a single row. If it is possible to convert each of those to a vector, I think that would speed things up a lot.

    However, assuming that they need to be data.frames, I'll create a function with code borrowed from Dominik's answer at Can rbind be parallelized in R?

    do.call.rbind <- function (lst) {
      while (length(lst) > 1) {
        idxlst <- seq(from = 1, to = length(lst), by = 2)
        lst <- lapply(idxlst, function(i) {
          if (i == length(lst)) {
            return(lst[[i]])
          }
          return(rbind(lst[[i]], lst[[i + 1]]))
        })
      }
      lst[[1]]
    }
    

    I have been using this function for several months, and have found it to be faster and use less memory than do.call(rbind, ...) [the disclaimer is that I've pretty much only used it on xts objects]

    The more rows that each data.frame has, and the more elements that the list has, the more beneficial this function will be.

    If you have a list of 100,000 numeric vectors, do.call(rbind, ...) will be better. If you have list of length one billion, this will be better.

    > df <- lapply(1:10000, function(x) data.frame(x = sample(21, 21)))
    > library(rbenchmark)
    > benchmark(a=do.call(rbind, df), b=do.call.rbind(df))
    test replications elapsed relative user.self sys.self user.child sys.child
    1    a          100 327.728 1.755965   248.620   79.099          0         0
    2    b          100 186.637 1.000000   181.874    4.751          0         0
    

    The relative speed up will be exponentially better as you increase the length of the list.

    0 讨论(0)
提交回复
热议问题