Convert Mixed-Length named List to data.frame

前端 未结 6 1498
夕颜
夕颜 2020-12-31 04:52

I have a list of the following format:

[[1]]
[[1]]$a
[1] 1

[[1]]$b
[1] 3

[[1]]$c
[1] 5

[[2]]       
[[2]]$c
[1] 2

[[2]]$a
[1] 3

There i

相关标签:
6条回答
  • 2020-12-31 05:04

    I know this is an old question, but I just came across it and it's excruciating not to see the simplest solution I'm aware of. So here it is (simply specify 'fill=TRUE' in rbindlist):

    library(data.table)
    list = list(list(a=1,b=3,c=5),list(c=2,a=3))
    rbindlist(list,fill=TRUE)
    
    #    a  b c
    # 1: 1  3 5
    # 2: 3 NA 2
    

    I don't know if this is the fastest way, but I'd be willing to bet that it competes, given data.table's thoughtful design and extremely good performance on a lot of other tasks.

    0 讨论(0)
  • 2020-12-31 05:21

    If you know the possible values beforehand, and you are dealing with large data, perhaps using data.table and set will be fast

    cc <- createList(50000)
    
    
    
    system.time({
    nas <- rep.int(NA_real_, length(cc))
    DT <-  setnames(as.data.table(replicate(length(ids),nas, simplify = FALSE)), ids)
    
    for(xx in seq_along(cc)){
    
      .n <- names(cc[[xx]])
      for(j in .n){
        set(DT, i = xx, j = j, value = cc[[xx]][[j]])
      }
    
    
    }
    
    })
    
    
    # user  system elapsed 
    # 0.68    0.01    0.70 
    

    Old (slow solution) for posterity

    full <- c('a','b', 'c')
    
    system.time({
    for(xx in seq_along(cc)) {
      mm <- setdiff(full, names(cc[[xx]]))
      if(length(mm) || all(names(cc[[xx]]) == full)){
      cc[[xx]] <- as.data.table(cc[[xx]])
      # any missing columns
    
      if(length(mm)){
      # if required add additional columns
        cc[[xx]][, (mm) := as.list(rep(NA_real_, length(mm)))]
      }
      # put columns in correct order
      setcolorder(cc[[xx]], full) 
      }
    }
    
     cdt <- rbindlist(cc)
    })
    
    #   user  system elapsed 
    # 21.83    0.06   22.00 
    

    This second solution has been left here to show how data.table can be used poorly.

    0 讨论(0)
  • 2020-12-31 05:22

    Here's my initial thought. It doesn't speed up your approach, but it does simplify the code considerably:

    # makeDF <- function(List, Names) {
    #     m <- t(sapply(List, function(X) unlist(X)[Names], 
    #     as.data.frame(m)
    # }    
    
    ## vapply() is a bit faster than sapply()
    makeDF <- function(List, Names) {
        m <- t(vapply(List, 
                      FUN = function(X) unlist(X)[Names], 
                      FUN.VALUE = numeric(length(Names))))
        as.data.frame(m)
    }
    
    ## Test timing with a 50k-item list
    ll <- createList(50000)
    nms <- c("a", "b", "c")
    
    system.time(makeDF(ll, nms))
    # user  system elapsed 
    # 0.47    0.00    0.47 
    
    0 讨论(0)
  • 2020-12-31 05:22

    In dplyr:

    bind_rows(lapply(x, as_data_frame))
    
    # A tibble: 2 x 3
          a     b     c
      <dbl> <dbl> <dbl>
    1     1     3     5
    2     3    NA     2
    
    0 讨论(0)
  • 2020-12-31 05:23

    Here is a short answer, I doubt it will be very fast though.

    > library(plyr)
    > rbind.fill(lapply(x, as.data.frame))
      a  b c
     1 1  3 5
     2 3 NA 2
    
    0 讨论(0)
  • 2020-12-31 05:23

    Well, I gave my first thought a try and the performance wasn't as bad as I was afraid of, but I'm sure there's still room for improvement (especially in the waster matrix -> data.frame conversion).

    convertList <- function(myList, ids){
        #this computes a list of the numerical index for each value to handle the missing/
        # improperly ordered list elements. So it will have a list in which each element 
        # associated with A has a value of 1, B ->2, and C -> 3. So a row containing
        # A=_, C=_, B=_ would have a value of `1,3,2`
        idInd <- lapply(myList, function(x){match(names(x), ids)})
    
        # Calculate the row indices if I were to unlist myList. So if there were two elements
        # in the first row, 3 in the third, and 1 in the fourth, you'd see: 1, 1, 2, 2, 2, 3
        rowInd <- inverse.rle(list(values=1:length(myList), lengths=sapply(myList, length)))
    
        #Unlist the first list created to just be a numerical matrix
        idInd <- unlist(idInd)
    
        #create a grid of addresses. The first column is the row address, the second is the col
        address <- cbind(rowInd, idInd)
    
        #have to use a matrix because you can't assign a data.frame 
        # using an addressing table like we have above
        mat <- matrix(ncol=length(ids), nrow=length(myList))
    
        # assign the values to the addresses in the matrix
        mat[address] <- unlist(myList)
    
        # convert to data.frame
        df <- as.data.frame(mat)
        colnames(df) <- ids
    
        df
    }   
    myList <- createList(50000)
    ids <- letters[1:3]
    
    system.time(df <- convertList(myList, ids))
    

    It's taking about 0.29 seconds to convert the 50,000 rows on my laptop (Windows 7, Intel i7 M620 @ 2.67 GHz, 4GB RAM).

    Still very much interested in other answers!

    0 讨论(0)
提交回复
热议问题