I want to convert a dataframe from wide format to long format.
Here it is a toy example:
mydata <- data.frame(ID=1:5, ZA_1=1:5,
ZA_2=
The OP has updated his answer to his own question complaining about the memory consumption of the intermediate melt()
step when half of the columns are id.vars
. He requested that data.table
needs a direct way to do it without creating giant middle steps.
Well, data.table
already does have that ability, it's called join.
Given the sample data from the Q, the whole operation can be implemented in a less memory consuming way by reshaping with only one id.var and later joining the reshaped result with the original data.table:
setDT(mydata)
# add unique row number to join on later
# (leave `ID` col as placeholder for all other id.vars)
mydata[, rn := seq_len(.N)]
# define columns to be reshaped
measure_cols <- stringr::str_subset(names(mydata), "_\\d$")
# melt with only one id.vars column
molten <- melt(mydata, id.vars = "rn", measure.vars = measure_cols)
# split column names of measure.vars
# Note that "variable" is reused to save memory
molten[, c("variable", "measure") := tstrsplit(variable, "_")]
# coerce names to factors in the same order as the columns appeared in mydata
molten[, variable := forcats::fct_inorder(variable)]
# remove columns no longer needed in mydata _before_ joining to save memory
mydata[, (measure_cols) := NULL]
# final dcast and right join
result <- mydata[dcast(molten, ... ~ variable), on = "rn"]
result
# ID rn measure ZA BB CC
# 1: 1 1 1 1 3 NA
# 2: 1 1 2 5 6 NA
# 3: 1 1 7 NA NA 6
# 4: 2 2 1 2 3 NA
# 5: 2 2 2 4 6 NA
# 6: 2 2 7 NA NA 5
# 7: 3 3 1 3 3 NA
# 8: 3 3 2 3 6 NA
# 9: 3 3 7 NA NA 4
#10: 4 4 1 4 3 NA
#11: 4 4 2 2 6 NA
#12: 4 4 7 NA NA 3
#13: 5 5 1 5 3 NA
#14: 5 5 2 1 6 NA
#15: 5 5 7 NA NA 2
Finally, you may remove the row number if no longer needed by result[, rn := NULL]
.
Furthermore, you can remove the intermediate molten
by rm(molten)
.
We have started with a data.table
consisting of 1 id column, 5 measure cols and 5 rows. The reshaped result has 1 id column, 3 measure cols, and 15 rows. So, the data volume stored in id columns effectively has tripled. However, the intermediate step needed only one id.var rn
.
EDIT If memory consumption is crucial, it might be worthwhile to consider to keep the id.vars and the measure.vars in two separate data.tables and to join only the necessary id.var columns with the measure.vars on demand.
Note that the measure.vars
parameter to melt()
allows for a special function patterns()
. With this the call to melt()
could have been written as well as
molten <- melt(mydata, id.vars = "rn", measure.vars = patterns("_\\d$"))
Here is a method using base R functions split.default
and do.call
.
# split the non-ID variables into groups based on their name suffix
myList <- split.default(mydata[-1], gsub(".*_(\\d)$", "\\1", names(mydata[-1])))
# append variables by row after setting the regularizing variable names, cbind ID
cbind(mydata[1],
do.call(rbind, lapply(myList, function(x) setNames(x, gsub("_\\d$", "", names(x))))))
ID ZA BB
1.1 1 1 3
1.2 2 2 3
1.3 3 3 3
1.4 4 4 3
1.5 5 5 3
2.1 1 5 6
2.2 2 4 6
2.3 3 3 6
2.4 4 2 6
2.5 5 1 6
The first line splits the data.frame variables (minus ID) into lists that agree on the final character of their variable name. This criterion is determined using gsub
. The second line uses do.call
to call rbind
on this list of variables, modified with setNames
so that the final digit and underscore are removed from their names. Finally, cbind
attaches the ID to the resulting data.frame.
Note that the data has to be structured regularly, with no missing variables, etc.
Finally I've found the way, modifying my initial solution
mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1,
BB_2001=rep(3,5),BB_2002=rep(6,5),CC_2007=6:2)
idvars = grep("_20[0-9][0-9]$",names(mydata) , invert = TRUE)
temp <- melt(mydata, id.vars = idvars)
temp[, `:=`(var = sub("_20[0-9][0-9]$", '', variable),
measure = sub('.*_', '', variable), variable = NULL)]
temp[,var:=factor(var, levels=unique(var))]
dcast( temp, ... ~ var, value.var='value' )
And it gives you the proper measure values. Anyway this solution needs a lot of memory.
The trick was converting the var variable to factor specifying the order I want with levels, as mtoto did. mtoto solution is nice because it doesn't need to cast and melt, only melt, but doesn't work in my updated example, only works when there are the same number of number variations for each word.
PD: I've being parsing every step and found that the melt step could be a big problem when working with large datatables. If you have a data.table with just 100000 rows x 1000 columns and use half of the columns as id.vars the output is approx 50000000 x 500, just too much to continue with the next step. data.table needs a direct way to do it without creating giant middle steps.
An alternative approach with data.table
:
melt(mydata, id = 'ID')[, c("variable", "measure") := tstrsplit(variable, '_')
][, variable := factor(variable, levels = unique(variable))
][, dcast(.SD, ID + measure ~ variable, value.var = 'value')]
which gives:
ID measure ZA BB CC 1: 1 1 1 3 NA 2: 1 2 5 6 NA 3: 1 7 NA NA 6 4: 2 1 2 3 NA 5: 2 2 4 6 NA 6: 2 7 NA NA 5 7: 3 1 3 3 NA 8: 3 2 3 6 NA 9: 3 7 NA NA 4 10: 4 1 4 3 NA 11: 4 2 2 6 NA 12: 4 7 NA NA 3 13: 5 1 5 3 NA 14: 5 2 1 6 NA 15: 5 7 NA NA 2
You can melt several columns simultaneously, if you pass a list of column names to the argument measure =
. One approach to do this in a scalable manner would be to:
Extract the column names and the corresponding first two letters:
measurevars <- names(mydata)[grepl("_[1-9]$",names(mydata))]
groups <- gsub("_[1-9]$","",measurevars)
Turn groups
into a factor object and make sure levels aren't ordered alphabetically. We'll use this in the next step to create a list object with the correct structure.
split_on <- factor(groups, levels = unique(groups))
Create a list using measurevars
with split()
, and create vector for the value.name =
argument in melt()
.
measure_list <- split(measurevars, split_on)
measurenames <- unique(groups)
Bringing it all together:
melt(setDT(mydata),
measure = measure_list,
value.name = measurenames,
variable.name = "measure")
# ID measure ZA BB
# 1: 1 1 1 3
# 2: 2 1 2 3
# 3: 3 1 3 3
# 4: 4 1 4 3
# 5: 5 1 5 3
# 6: 1 2 5 6
# 7: 2 2 4 6
# 8: 3 2 3 6
# 9: 4 2 2 6
#10: 5 2 1 6