Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply

耗尽温柔 提交于 2019-11-27 11:56:22

Given that you are looking for performance, it appears that a data.table solution should be suggested.

There is a function rbindlist which is the same but much faster than do.call(rbind, list)

library(data.table)
X <- replicate(50000, data.table(a=rnorm(5), b=1:5), simplify=FALSE)
system.time(rbindlist.data.table <- rbindlist(X))
##  user  system elapsed 
##  0.00    0.01    0.02

It is also very fast for a list of data.frame

Xdf <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)

system.time(rbindlist.data.frame <- rbindlist(Xdf))
##  user  system elapsed 
##  0.03    0.00    0.03

For comparison

system.time(docall <- do.call(rbind, Xdf))
##  user  system elapsed 
## 50.72    9.89   60.88 

And some proper benchmarking

library(rbenchmark)
benchmark(rbindlist.data.table = rbindlist(X), 
           rbindlist.data.frame = rbindlist(Xdf),
           docall = do.call(rbind, Xdf),
           replications = 5)
##                   test replications elapsed    relative user.self sys.self 
## 3               docall            5  276.61 3073.444445    264.08     11.4 
## 2 rbindlist.data.frame            5    0.11    1.222222      0.11      0.0 
## 1 rbindlist.data.table            5    0.09    1.000000      0.09      0.0 

and against @JoshuaUlrich's solutions

benchmark(use.rbl.dt  = rbl.dt(X), 
          use.rbl.ju  = rbl.ju (Xdf),
          use.rbindlist =rbindlist(X) ,
          replications = 5)

##              test replications elapsed relative user.self 
## 3  use.rbindlist            5    0.10      1.0      0.09
## 1     use.rbl.dt            5    0.10      1.0      0.09
## 2     use.rbl.ju            5    0.33      3.3      0.31 

I'm not sure you really need to use as.data.frame, because a data.table inherits class data.frame

rbind.data.frame does a lot of checking you don't need. This should be a pretty quick transformation if you only do exactly what you want.

# Use data from Josh O'Brien's post.
set.seed(21)
X <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
system.time({
Names <- names(X[[1]])  # Get data.frame names from first list element.
# For each name, extract its values from each data.frame in the list.
# This provides a list with an element for each name.
Xb <- lapply(Names, function(x) unlist(lapply(X, `[[`, x)))
names(Xb) <- Names          # Give Xb the correct names.
Xb.df <- as.data.frame(Xb)  # Convert Xb to a data.frame.
})
#    user  system elapsed 
#   3.356   0.024   3.388 
system.time(X1 <- do.call(rbind, X))
#    user  system elapsed 
# 169.627   6.680 179.675
identical(X1,Xb.df)
# [1] TRUE

Inspired by the data.table answer, I decided to try and make this even faster. Here's my updated solution, to try and keep the check mark. ;-)

# My "rbind list" function
rbl.ju <- function(x) {
  u <- unlist(x, recursive=FALSE)
  n <- names(u)
  un <- unique(n)
  l <- lapply(un, function(N) unlist(u[N==n], FALSE, FALSE))
  names(l) <- un
  d <- as.data.frame(l)
}
# simple wrapper to rbindlist that returns a data.frame
rbl.dt <- function(x) {
  as.data.frame(rbindlist(x))
}

library(data.table)
if(packageVersion("data.table") >= '1.8.2') {
  system.time(dt <- rbl.dt(X))  # rbindlist only exists in recent versions
}
#    user  system elapsed 
#    0.02    0.00    0.02
system.time(ju <- rbl.ju(X))
#    user  system elapsed 
#    0.05    0.00    0.05 
identical(dt,ju)
# [1] TRUE

Your observation that the time taken increases exponentially with the number of data.frames suggests that breaking the rbinding into two stages could speed things up.

This simple experiment seems to confirm that that's a very fruitful path to take:

## Make a list of 50,000 data.frames
X <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)

## First, rbind together all 50,000 data.frames in a single step
system.time({
    X1 <- do.call(rbind, X)
})
#    user  system elapsed 
# 137.08   57.98  200.08 


## Doing it in two stages cuts the processing time by >95%
##   - In Stage 1, 100 groups of 500 data.frames are rbind'ed together
##   - In Stage 2, the resultant 100 data.frames are rbind'ed
system.time({
    X2 <- lapply(1:100, function(i) do.call(rbind, X[((i*500)-499):(i*500)]))
    X3 <- do.call(rbind, X2)
}) 
#    user  system elapsed 
#    6.14    0.05    6.21 


## Checking that the results are the same
identical(X1, X3)
# [1] TRUE
GSee

You have a list of data.frames that each have a single row. If it is possible to convert each of those to a vector, I think that would speed things up a lot.

However, assuming that they need to be data.frames, I'll create a function with code borrowed from Dominik's answer at Can rbind be parallelized in R?

do.call.rbind <- function (lst) {
  while (length(lst) > 1) {
    idxlst <- seq(from = 1, to = length(lst), by = 2)
    lst <- lapply(idxlst, function(i) {
      if (i == length(lst)) {
        return(lst[[i]])
      }
      return(rbind(lst[[i]], lst[[i + 1]]))
    })
  }
  lst[[1]]
}

I have been using this function for several months, and have found it to be faster and use less memory than do.call(rbind, ...) [the disclaimer is that I've pretty much only used it on xts objects]

The more rows that each data.frame has, and the more elements that the list has, the more beneficial this function will be.

If you have a list of 100,000 numeric vectors, do.call(rbind, ...) will be better. If you have list of length one billion, this will be better.

> df <- lapply(1:10000, function(x) data.frame(x = sample(21, 21)))
> library(rbenchmark)
> benchmark(a=do.call(rbind, df), b=do.call.rbind(df))
test replications elapsed relative user.self sys.self user.child sys.child
1    a          100 327.728 1.755965   248.620   79.099          0         0
2    b          100 186.637 1.000000   181.874    4.751          0         0

The relative speed up will be exponentially better as you increase the length of the list.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!