What is the most efficient way to cast a list as a data frame?

前端 未结 7 842
悲&欢浪女 2020-11-29 16:55

Very often I want to convert a list wherein each index has identical element types to a data frame. For example, I may have a list:

> my.list

  • 2020-11-29 17:18

    I can't tell you this is the "most efficient" in terms of memory or speed, but it's pretty efficient in terms of coding:

    my.df <- do.call("rbind", lapply(my.list, data.frame))

    the lapply() step with data.frame() turns each list item into a single row data frame which then acts nice with rbind()

    0 讨论(0)
  • 2020-11-29 17:24

    Although this question has long since been answered, it's worth pointing out the data.table package has rbindlist which accomplishes this task very quickly:

    l <- replicate(1E4, list(a=runif(1), b=runif(1), c=runif(1)), simplify=FALSE)
    microbenchmark( times=5,
      R=as.data.frame(Map(f(l), names(l[[1]]))),

    gives me

    Unit: milliseconds
     expr       min        lq    median        uq       max neval
        R 31.060119 31.403943 32.278537 32.370004 33.932700     5
       dt  2.271059  2.273157  2.600976  2.635001  2.729421     5
    0 讨论(0)
  • 2020-11-29 17:24

    Not sure where they rank as far as efficiency, but depending on the structure of your lists there are some tidyverse options. A bonus is that they work nicely with unequal length lists:

    l <- list(a = list(var.1 = 1, var.2 = 2, var.3 = 3)
            , b = list(var.1 = 4, var.2 = 5)
            , c = list(var.1 = 7, var.3 = 9)
            , d = list(var.1 = 10, var.2 = 11, var.3 = NA))
    df <- dplyr::bind_rows(l)
    df <- purrr::map_df(l, dplyr::bind_rows)
    df <- purrr::map_df(l, ~.x)
    # all create the same data frame:
    # A tibble: 4 x 3
      var.1 var.2 var.3
      <dbl> <dbl> <dbl>
    1     1     2     3
    2     4     5    NA
    3     7    NA     9
    4    10    11    NA

    And you can also mix vectors and data frames:

      list(a = 1, b = 2),
      data_frame(a = 3:4, b = 5:6),
      c(a = 7)
    # A tibble: 4 x 2
          a     b
      <dbl> <dbl>
    1     1     2
    2     3     5
    3     4     6
    4     7    NA
    0 讨论(0)
  • 2020-11-29 17:34

    The dplyr package's bind_rows is efficient.

    one <- mtcars[1:4, ]
    two <- mtcars[11:14, ]
    system.time(dplyr::bind_rows(one, two))
       user  system elapsed 
      0.001   0.000   0.001 
    0 讨论(0)
  • 2020-11-29 17:38

    I think you want:

    > do.call(rbind, lapply(my.list, data.frame, stringsAsFactors=FALSE))
      global_stdev_ppb      range   tok global_freq_ppb
    1         24267673 0.03114799 hello        211592.6
    2         11561448 0.08870838 world       1002043.0
    > str(do.call(rbind, lapply(my.list, data.frame, stringsAsFactors=FALSE)))
    'data.frame':   2 obs. of  4 variables:
     $ global_stdev_ppb: num  24267673 11561448
     $ range           : num  0.0311 0.0887
     $ tok             : chr  "hello" "world"
     $ global_freq_ppb : num  211593 1002043
    0 讨论(0)
  • 2020-11-29 17:38

    Another option is:

    data.frame(t(sapply(mylist, `[`)))

    but this simple manipulation results in a data frame of lists:

    > str(data.frame(t(sapply(mylist, `[`))))
    'data.frame':   2 obs. of  3 variables:
     $ a:List of 2
      ..$ : num 1
      ..$ : num 2
     $ b:List of 2
      ..$ : num 2
      ..$ : num 3
     $ c:List of 2
      ..$ : chr "a"
      ..$ : chr "b"

    An alternative to this, along the same lines but now the result same as the other solutions, is:

    data.frame(lapply(data.frame(t(sapply(mylist, `[`))), unlist))

    [Edit: included timings of @Martin Morgan's two solutions, which have the edge over the other solution that return a data frame of vectors.] Some representative timings on a very simple problem:

    mylist <- list(list(a = 1, b = 2, c = "a"), list(a = 2, b = 3, c = "b"))
    > ## @Joshua Ulrich's solution:
    > system.time(replicate(1000, do.call(rbind, lapply(mylist, data.frame,
    +                                     stringsAsFactors=FALSE))))
       user  system elapsed 
      1.740   0.001   1.750
    > ## @JD Long's solution:
    > system.time(replicate(1000, do.call(rbind, lapply(mylist, data.frame))))
       user  system elapsed 
      2.308   0.002   2.339
    > ## my sapply solution No.1:
    > system.time(replicate(1000, data.frame(t(sapply(mylist, `[`)))))
       user  system elapsed 
      0.296   0.000   0.301
    > ## my sapply solution No.2:
    > system.time(replicate(1000, data.frame(lapply(data.frame(t(sapply(mylist, `[`))), 
    +                                               unlist))))
       user  system elapsed 
      1.067   0.001   1.091
    > ## @Martin Morgan's Map() sapply() solution:
    > f = function(x) function(i) sapply(x, `[[`, i)
    > system.time(replicate(1000, as.data.frame(Map(f(mylist), names(mylist[[1]])))))
       user  system elapsed 
      0.775   0.000   0.778
    > ## @Martin Morgan's Map() lapply() unlist() solution:
    > f = function(x) function(i) unlist(lapply(x, `[[`, i), use.names=FALSE)
    > system.time(replicate(1000, as.data.frame(Map(f(mylist), names(mylist[[1]])))))
       user  system elapsed 
      0.653   0.000   0.658
    0 讨论(0)