I want to find the best \"R way\" to flatten a dataframe that looks like this:
CAT COUNT TREAT
A 1,2,3 Treat-a, Treat-b
B 4,5
Here is a solution using base R, accepting vectors of any length inside your list and no need to specify which columns of the dataframe you want to collapse. Part of the solution was generated using this answer.
df2 <- do.call(cbind,lapply(df,function(x){
#check if it is a list, otherwise just return as is
if(is.list(x)){
return(data.frame(t(sapply(x,'[',seq(max(sapply(x,length)))))))
} else{
return(x)
}
}))
As of R 3.2 there is lengths
to replace sapply(x, length)
as well,
df3 <- do.call(cbind.data.frame, lapply(df, function(x) {
# check if it is a list, otherwise just return as is
if (is.list(x)) {
data.frame(t(sapply(x,'[', seq(max(lengths(x))))))
} else {
x
}
}))
data used:
df <- structure(list(CAT = structure(1:2, .Label = c("A", "B"), class = "factor"),
COUNT = list(1:3, 4:5), TREAT = list(c("Treat-a", "Treat-b"
), c("Treat-c", "Treat-d", "Treat-e"))), .Names = c("CAT",
"COUNT", "TREAT"), row.names = c(NA, -2L), class = "data.frame")
There's a deleted answer here that indicates that "splitstackshape" could be used for this. It can, but the deleted answer used the wrong function. Instead, it should use the listCol_w
function. Unfortunately, in its present form, this function is not vectorized across columns, so you would need to nest the calls to listCol_w
for each column that needs to be flattened.
Here's the approach:
library(splitstackshape)
listCol_w(listCol_w(df, "COUNT", fill = NA), "TREAT", fill = NA)
## CAT COUNT_fl_1 COUNT_fl_2 COUNT_fl_3 TREAT_fl_1 TREAT_fl_2 TREAT_fl_3
## 1: A 1 2 3 Treat-a Treat-b NA
## 2: B 4 5 NA Treat-c Treat-d Treat-e
Note that fill = NA
has been specified because it defaults to fill = NA_character_
, which would otherwise coerce all the values to character.
Another alternative would be to use transpose
from "data.table". Here's a possible implementation (looks scary, but using the function is easy). Benefits are that (1) you can specify the columns to flatten, (2) you can decide whether you want to drop the original column or not, and (3) it's fast.
flatten <- function(indt, cols, drop = FALSE) {
require(data.table)
if (!is.data.table(indt)) indt <- as.data.table(indt)
x <- unlist(indt[, lapply(.SD, function(x) max(lengths(x))), .SDcols = cols])
nams <- paste(rep(cols, x), sequence(x), sep = "_")
indt[, (nams) := unlist(lapply(.SD, transpose), recursive = FALSE), .SDcols = cols]
if (isTRUE(drop)) {
indt[, (nams) := unlist(lapply(.SD, transpose), recursive = FALSE),
.SDcols = cols][, (cols) := NULL]
}
indt[]
}
Usage would be...
Keeping original columns:
flatten(df, c("COUNT", "TREAT"))
# CAT COUNT TREAT COUNT_1 COUNT_2 COUNT_3 TREAT_1 TREAT_2 TREAT_3
# 1: A 1,2,3 Treat-a,Treat-b 1 2 3 Treat-a Treat-b NA
# 2: B 4,5 Treat-c,Treat-d,Treat-e 4 5 NA Treat-c Treat-d Treat-e
Dropping original columns:
flatten(df, c("COUNT", "TREAT"), TRUE)
# CAT COUNT_1 COUNT_2 COUNT_3 TREAT_1 TREAT_2 TREAT_3
# 1: A 1 2 3 Treat-a Treat-b NA
# 2: B 4 5 NA Treat-c Treat-d Treat-e
See this gist for a comparison with the other solutions proposed.
Here is another way in base r
df<-data.frame(CAT=c("A","B"))
df$COUNT <-list(1:3,4:5)
df$TREAT <-list(paste("Treat-", letters[1:2],sep=""),paste("Treat-", letters[3:5],sep=""))
Create a helper function to do the work
f <- function(l) {
if (!is.list(l)) return(l)
do.call('rbind', lapply(l, function(x) `length<-`(x, max(lengths(l)))))
}
Always test your code
f(df$TREAT)
# [,1] [,2] [,3]
# [1,] "Treat-a" "Treat-b" NA
# [2,] "Treat-c" "Treat-d" "Treat-e"
Apply it
df[] <- lapply(df, f)
df
# CAT COUNT.1 COUNT.2 COUNT.3 TREAT.1 TREAT.2 TREAT.3
# 1 A 1 2 3 Treat-a Treat-b <NA>
# 2 B 4 5 NA Treat-c Treat-d Treat-e