Combine a list of data frames into one data frame

前端 未结 9 2059
半阙折子戏
半阙折子戏 2020-11-21 11:25

I have code that at one place ends up with a list of data frames which I really want to convert to a single big data frame.

I got some pointers from an earlier ques

相关标签:
9条回答
  • 2020-11-21 11:53

    Code:

    library(microbenchmark)
    
    dflist <- vector(length=10,mode="list")
    for(i in 1:100)
    {
      dflist[[i]] <- data.frame(a=runif(n=260),b=runif(n=260),
                                c=rep(LETTERS,10),d=rep(LETTERS,10))
    }
    
    
    mb <- microbenchmark(
    plyr::rbind.fill(dflist),
    dplyr::bind_rows(dflist),
    data.table::rbindlist(dflist),
    plyr::ldply(dflist,data.frame),
    do.call("rbind",dflist),
    times=1000)
    
    ggplot2::autoplot(mb)
    

    Session:

    R version 3.3.0 (2016-05-03)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows 7 x64 (build 7601) Service Pack 1
    
    > packageVersion("plyr")
    [1] ‘1.8.4’
    > packageVersion("dplyr")
    [1] ‘0.5.0’
    > packageVersion("data.table")
    [1] ‘1.9.6’
    

    UPDATE: Rerun 31-Jan-2018. Ran on the same computer. New versions of packages. Added seed for seed lovers.

    set.seed(21)
    library(microbenchmark)
    
    dflist <- vector(length=10,mode="list")
    for(i in 1:100)
    {
      dflist[[i]] <- data.frame(a=runif(n=260),b=runif(n=260),
                                c=rep(LETTERS,10),d=rep(LETTERS,10))
    }
    
    
    mb <- microbenchmark(
      plyr::rbind.fill(dflist),
      dplyr::bind_rows(dflist),
      data.table::rbindlist(dflist),
      plyr::ldply(dflist,data.frame),
      do.call("rbind",dflist),
      times=1000)
    
    ggplot2::autoplot(mb)+theme_bw()
    
    
    R version 3.4.0 (2017-04-21)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows 7 x64 (build 7601) Service Pack 1
    
    > packageVersion("plyr")
    [1] ‘1.8.4’
    > packageVersion("dplyr")
    [1] ‘0.7.2’
    > packageVersion("data.table")
    [1] ‘1.10.4’
    

    UPDATE: Rerun 06-Aug-2019.

    set.seed(21)
    library(microbenchmark)
    
    dflist <- vector(length=10,mode="list")
    for(i in 1:100)
    {
      dflist[[i]] <- data.frame(a=runif(n=260),b=runif(n=260),
                                c=rep(LETTERS,10),d=rep(LETTERS,10))
    }
    
    
    mb <- microbenchmark(
      plyr::rbind.fill(dflist),
      dplyr::bind_rows(dflist),
      data.table::rbindlist(dflist),
      plyr::ldply(dflist,data.frame),
      do.call("rbind",dflist),
      purrr::map_df(dflist,dplyr::bind_rows),
      times=1000)
    
    ggplot2::autoplot(mb)+theme_bw()
    
    R version 3.6.0 (2019-04-26)
    Platform: x86_64-pc-linux-gnu (64-bit)
    Running under: Ubuntu 18.04.2 LTS
    
    Matrix products: default
    BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
    LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
    
    packageVersion("plyr")
    packageVersion("dplyr")
    packageVersion("data.table")
    packageVersion("purrr")
    
    >> packageVersion("plyr")
    [1] ‘1.8.4’
    >> packageVersion("dplyr")
    [1] ‘0.8.3’
    >> packageVersion("data.table")
    [1] ‘1.12.2’
    >> packageVersion("purrr")
    [1] ‘0.3.2’
    
    0 讨论(0)
  • 2020-11-21 11:55

    Use bind_rows() from the dplyr package:

    bind_rows(list_of_dataframes, .id = "column_label")
    
    0 讨论(0)
  • 2020-11-21 11:58

    Here's another way this can be done (just adding it to the answers because reduce is a very effective functional tool that is often overlooked as a replacement for loops. In this particular case, neither of these are significantly faster than do.call)

    using base R:

    df <- Reduce(rbind, listOfDataFrames)
    

    or, using the tidyverse:

    library(tidyverse) # or, library(dplyr); library(purrr)
    df <- listOfDataFrames %>% reduce(bind_rows)
    
    0 讨论(0)
  • 2020-11-21 12:03

    There is also bind_rows(x, ...) in dplyr.

    > system.time({ df.Base <- do.call("rbind", listOfDataFrames) })
       user  system elapsed 
       0.08    0.00    0.07 
    > 
    > system.time({ df.dplyr <- as.data.frame(bind_rows(listOfDataFrames)) })
       user  system elapsed 
       0.01    0.00    0.02 
    > 
    > identical(df.Base, df.dplyr)
    [1] TRUE
    
    0 讨论(0)
  • 2020-11-21 12:05

    One other option is to use a plyr function:

    df <- ldply(listOfDataFrames, data.frame)
    

    This is a little slower than the original:

    > system.time({ df <- do.call("rbind", listOfDataFrames) })
       user  system elapsed 
       0.25    0.00    0.25 
    > system.time({ df2 <- ldply(listOfDataFrames, data.frame) })
       user  system elapsed 
       0.30    0.00    0.29
    > identical(df, df2)
    [1] TRUE
    

    My guess is that using do.call("rbind", ...) is going to be the fastest approach that you will find unless you can do something like (a) use a matrices instead of a data.frames and (b) preallocate the final matrix and assign to it rather than growing it.

    Edit 1:

    Based on Hadley's comment, here's the latest version of rbind.fill from CRAN:

    > system.time({ df3 <- rbind.fill(listOfDataFrames) })
       user  system elapsed 
       0.24    0.00    0.23 
    > identical(df, df3)
    [1] TRUE
    

    This is easier than rbind, and marginally faster (these timings hold up over multiple runs). And as far as I understand it, the version of plyr on github is even faster than this.

    0 讨论(0)
  • 2020-11-21 12:07

    The only thing that the solutions with data.table are missing is the identifier column to know from which dataframe in the list the data is coming from.

    Something like this:

    df_id <- data.table::rbindlist(listOfDataFrames, idcol = TRUE)
    

    The idcol parameter adds a column (.id) identifying the origin of the dataframe contained in the list. The result would look to something like this:

    .id a         b           c
    1   u   -0.05315128 -1.31975849 
    1   b   -1.00404849 1.15257952  
    1   y   1.17478229  -0.91043925 
    1   q   -1.65488899 0.05846295  
    1   c   -1.43730524 0.95245909  
    1   b   0.56434313  0.93813197  
    
    0 讨论(0)
提交回复
热议问题