Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply

前端未结

关注

 4  1469

I know there are many questions here in SO about ways to convert a list of data.frames to a single data.frame using do.call or ldply, but this questions is about understandi

相关标签:

4条回答

孤街浪徒

2020-12-02 19:29

rbind.data.frame does a lot of checking you don't need. This should be a pretty quick transformation if you only do exactly what you want.

# Use data from Josh O'Brien's post.
set.seed(21)
X <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
system.time({
Names <- names(X[[1]])  # Get data.frame names from first list element.
# For each name, extract its values from each data.frame in the list.
# This provides a list with an element for each name.
Xb <- lapply(Names, function(x) unlist(lapply(X, `[[`, x)))
names(Xb) <- Names          # Give Xb the correct names.
Xb.df <- as.data.frame(Xb)  # Convert Xb to a data.frame.
})
#    user  system elapsed 
#   3.356   0.024   3.388 
system.time(X1 <- do.call(rbind, X))
#    user  system elapsed 
# 169.627   6.680 179.675
identical(X1,Xb.df)
# [1] TRUE

Inspired by the data.table answer, I decided to try and make this even faster. Here's my updated solution, to try and keep the check mark. ;-)

# My "rbind list" function
rbl.ju <- function(x) {
  u <- unlist(x, recursive=FALSE)
  n <- names(u)
  un <- unique(n)
  l <- lapply(un, function(N) unlist(u[N==n], FALSE, FALSE))
  names(l) <- un
  d <- as.data.frame(l)
}
# simple wrapper to rbindlist that returns a data.frame
rbl.dt <- function(x) {
  as.data.frame(rbindlist(x))
}

library(data.table)
if(packageVersion("data.table") >= '1.8.2') {
  system.time(dt <- rbl.dt(X))  # rbindlist only exists in recent versions
}
#    user  system elapsed 
#    0.02    0.00    0.02
system.time(ju <- rbl.ju(X))
#    user  system elapsed 
#    0.05    0.00    0.05 
identical(dt,ju)
# [1] TRUE

0 讨论(0)

轻奢々

2020-12-02 19:36

Your observation that the time taken increases exponentially with the number of data.frames suggests that breaking the rbinding into two stages could speed things up.

This simple experiment seems to confirm that that's a very fruitful path to take:

## Make a list of 50,000 data.frames
X <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)

## First, rbind together all 50,000 data.frames in a single step
system.time({
    X1 <- do.call(rbind, X)
})
#    user  system elapsed 
# 137.08   57.98  200.08 


## Doing it in two stages cuts the processing time by >95%
##   - In Stage 1, 100 groups of 500 data.frames are rbind'ed together
##   - In Stage 2, the resultant 100 data.frames are rbind'ed
system.time({
    X2 <- lapply(1:100, function(i) do.call(rbind, X[((i*500)-499):(i*500)]))
    X3 <- do.call(rbind, X2)
}) 
#    user  system elapsed 
#    6.14    0.05    6.21 


## Checking that the results are the same
identical(X1, X3)
# [1] TRUE

0 讨论(0)

北恋

2020-12-02 19:39

Given that you are looking for performance, it appears that a data.table solution should be suggested.

There is a function rbindlist which is the same but much faster than do.call(rbind, list)

library(data.table)
X <- replicate(50000, data.table(a=rnorm(5), b=1:5), simplify=FALSE)
system.time(rbindlist.data.table <- rbindlist(X))
##  user  system elapsed 
##  0.00    0.01    0.02

It is also very fast for a list of data.frame

Xdf <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)

system.time(rbindlist.data.frame <- rbindlist(Xdf))
##  user  system elapsed 
##  0.03    0.00    0.03

For comparison

system.time(docall <- do.call(rbind, Xdf))
##  user  system elapsed 
## 50.72    9.89   60.88

And some proper benchmarking

library(rbenchmark)
benchmark(rbindlist.data.table = rbindlist(X), 
           rbindlist.data.frame = rbindlist(Xdf),
           docall = do.call(rbind, Xdf),
           replications = 5)
##                   test replications elapsed    relative user.self sys.self 
## 3               docall            5  276.61 3073.444445    264.08     11.4 
## 2 rbindlist.data.frame            5    0.11    1.222222      0.11      0.0 
## 1 rbindlist.data.table            5    0.09    1.000000      0.09      0.0

and against @JoshuaUlrich's solutions

benchmark(use.rbl.dt  = rbl.dt(X), 
          use.rbl.ju  = rbl.ju (Xdf),
          use.rbindlist =rbindlist(X) ,
          replications = 5)

##              test replications elapsed relative user.self 
## 3  use.rbindlist            5    0.10      1.0      0.09
## 1     use.rbl.dt            5    0.10      1.0      0.09
## 2     use.rbl.ju            5    0.33      3.3      0.31

I'm not sure you really need to use as.data.frame, because a data.table inherits class data.frame

0 讨论(0)

醉话见心

2020-12-02 19:43
You have a list of data.frames that each have a single row. If it is possible to convert each of those to a vector, I think that would speed things up a lot.

However, assuming that they need to be data.frames, I'll create a function with code borrowed from Dominik's answer at Can rbind be parallelized in R?
```
do.call.rbind <- function (lst) {
  while (length(lst) > 1) {
    idxlst <- seq(from = 1, to = length(lst), by = 2)
    lst <- lapply(idxlst, function(i) {
      if (i == length(lst)) {
        return(lst[[i]])
      }
      return(rbind(lst[[i]], lst[[i + 1]]))
    })
  }
  lst[[1]]
}
```
I have been using this function for several months, and have found it to be faster and use less memory than do.call(rbind, ...) [the disclaimer is that I've pretty much only used it on xts objects]

The more rows that each data.frame has, and the more elements that the list has, the more beneficial this function will be.

If you have a list of 100,000 numeric vectors, do.call(rbind, ...) will be better. If you have list of length one billion, this will be better.
```
> df <- lapply(1:10000, function(x) data.frame(x = sample(21, 21)))
> library(rbenchmark)
> benchmark(a=do.call(rbind, df), b=do.call.rbind(df))
test replications elapsed relative user.self sys.self user.child sys.child
1    a          100 327.728 1.755965   248.620   79.099          0         0
2    b          100 186.637 1.000000   181.874    4.751          0         0
```
The relative speed up will be exponentially better as you increase the length of the list.
0 讨论(0)
发布评论:

提交评论
- 加载中...