Given an arbitrary list of column names in a data.table
, I want to concatenate the contents of those columns into a single string stored in a new column. The column
I don't know how representative the sample data is for your actual data, but in the case of your sampled data you can achieve a substantial performance improvement by only concatenating each unique combination of ConcatCols once instead of multiple times.
That means for the sample data, you'd be looking at ~500k concatenations vs 10 million if you do all the duplicates too.
See the following code and timing example:
system.time({
setkeyv(DT, ConcatCols)
DTunique <- unique(DT[, ConcatCols, with=FALSE], by = key(DT))
DTunique[, State := do.call(paste, c(DTunique, sep = ""))]
DT[DTunique, State := i.State, on = ConcatCols]
})
# user system elapsed
# 7.448 0.462 4.618
About half the time is spent on the setkey
part. In case your data is already keyed, the time is cut down further to just a bit more than 2 seconds.
setkeyv(DT, ConcatCols)
system.time({
DTunique <- unique(DT[, ConcatCols, with=FALSE], by = key(DT))
DTunique[, State := do.call(paste, c(DTunique, sep = ""))]
DT[DTunique, State := i.State, on = ConcatCols]
})
# user system elapsed
# 2.526 0.280 2.181