Fast concatenation of data.table columns into one string column

后端 未结 3 1589
挽巷
挽巷 2021-01-31 18:35

Given an arbitrary list of column names in a data.table, I want to concatenate the contents of those columns into a single string stored in a new column. The column

3条回答
  •  旧巷少年郎
    2021-01-31 18:46

    I don't know how representative the sample data is for your actual data, but in the case of your sampled data you can achieve a substantial performance improvement by only concatenating each unique combination of ConcatCols once instead of multiple times.

    That means for the sample data, you'd be looking at ~500k concatenations vs 10 million if you do all the duplicates too.

    See the following code and timing example:

    system.time({
      setkeyv(DT, ConcatCols)
      DTunique <- unique(DT[, ConcatCols, with=FALSE], by = key(DT))
      DTunique[, State :=  do.call(paste, c(DTunique, sep = ""))]
      DT[DTunique, State := i.State, on = ConcatCols]
    })
    #       user      system     elapsed 
    #      7.448       0.462       4.618 
    

    About half the time is spent on the setkey part. In case your data is already keyed, the time is cut down further to just a bit more than 2 seconds.

    setkeyv(DT, ConcatCols)
    system.time({
      DTunique <- unique(DT[, ConcatCols, with=FALSE], by = key(DT))
      DTunique[, State :=  do.call(paste, c(DTunique, sep = ""))]
      DT[DTunique, State := i.State, on = ConcatCols]
    })
    #       user      system     elapsed 
    #      2.526       0.280       2.181 
    

提交回复
热议问题