How to flatten R data frame that contains lists?

后端 未结 3 1964
梦谈多话
梦谈多话 2020-12-17 15:08

I want to find the best \"R way\" to flatten a dataframe that looks like this:

  CAT    COUNT     TREAT
   A     1,2,3     Treat-a, Treat-b
   B     4,5              


        
3条回答
  •  醉梦人生
    2020-12-17 15:49

    There's a deleted answer here that indicates that "splitstackshape" could be used for this. It can, but the deleted answer used the wrong function. Instead, it should use the listCol_w function. Unfortunately, in its present form, this function is not vectorized across columns, so you would need to nest the calls to listCol_w for each column that needs to be flattened.

    Here's the approach:

    library(splitstackshape)
    listCol_w(listCol_w(df, "COUNT", fill = NA), "TREAT", fill = NA)
    ##    CAT COUNT_fl_1 COUNT_fl_2 COUNT_fl_3 TREAT_fl_1 TREAT_fl_2 TREAT_fl_3
    ## 1:   A          1          2          3    Treat-a    Treat-b         NA
    ## 2:   B          4          5         NA    Treat-c    Treat-d    Treat-e
    

    Note that fill = NA has been specified because it defaults to fill = NA_character_, which would otherwise coerce all the values to character.


    Another alternative would be to use transpose from "data.table". Here's a possible implementation (looks scary, but using the function is easy). Benefits are that (1) you can specify the columns to flatten, (2) you can decide whether you want to drop the original column or not, and (3) it's fast.

    flatten <- function(indt, cols, drop = FALSE) {
      require(data.table)
      if (!is.data.table(indt)) indt <- as.data.table(indt)
      x <- unlist(indt[, lapply(.SD, function(x) max(lengths(x))), .SDcols = cols])
      nams <- paste(rep(cols, x), sequence(x), sep = "_")
      indt[, (nams) := unlist(lapply(.SD, transpose), recursive = FALSE), .SDcols = cols]
      if (isTRUE(drop)) {
        indt[, (nams) := unlist(lapply(.SD, transpose), recursive = FALSE), 
             .SDcols = cols][, (cols) := NULL]
      }
      indt[]
    }
    

    Usage would be...

    Keeping original columns:

    flatten(df, c("COUNT", "TREAT"))
    #    CAT COUNT                   TREAT COUNT_1 COUNT_2 COUNT_3 TREAT_1 TREAT_2 TREAT_3
    # 1:   A 1,2,3         Treat-a,Treat-b       1       2       3 Treat-a Treat-b      NA
    # 2:   B   4,5 Treat-c,Treat-d,Treat-e       4       5      NA Treat-c Treat-d Treat-e
    

    Dropping original columns:

    flatten(df, c("COUNT", "TREAT"), TRUE)
    #    CAT COUNT_1 COUNT_2 COUNT_3 TREAT_1 TREAT_2 TREAT_3
    # 1:   A       1       2       3 Treat-a Treat-b      NA
    # 2:   B       4       5      NA Treat-c Treat-d Treat-e
    

    See this gist for a comparison with the other solutions proposed.

提交回复
热议问题