问题
I frequently use recycling in data.table, for exemple when I need to make projections future years. I repeat my original data fro each future year.
This can lead to something like that :
library(data.table)
dt <- data.table(cbind(1:500000, 500000:1))
dt2 <- dt[, c(.SD, .(year = 1:10)), by = 1:nrow(dt) ]
But I often have to deal with millions of lines, and far more columns than in this toy exemple. The time increases .. Try this :
library(data.table)
dt <- data.table(cbind(1:50000000, 50000000:1))
dt2 <- dt[, c(.SD, .(year = 1:10)), by = 1:nrow(dt) ]
My question is : is there a more efficient to achieve this purpose ?
Thanks for any help !
EDIT : the accepted answer was the most complete (till now), for this formulation of the problem, but I realised that my issue is a little bit more tricky. I will ask another question in order to show it : data.table efficient recycling V2
回答1:
I'm benchmarking the solutions given so far against my own (which simply uses lapply
and rbindlist
). I couldn't run the entire task because I run out of memory. That's why I choose a smaller dt:
library(data.table)
dt <- data.table(cbind(1:5000000, 5000000:1))
original <- function() {
dt2 <- dt[, c(.SD, .(year = 1:10)), by = 1:nrow(dt) ]
dt2
}
sb <- function() {
dt2 <- dt[CJ(V1, year = 1:10), on = "V1"]
}
gregor <- function() {
CJDT <- function(...) {
Reduce(function(DT1, DT2) cbind(DT1, DT2[rep(1:.N, each=nrow(DT1))]), list(...))
}
years = data.table(year = 1:10, key = "year")
setkey(dt)
dt3 = CJDT(dt, years)
dt3
}
bindlist <- function() {
dt3 <- rbindlist(lapply(1:10, function(x) {
dt$year <- x
dt
}))
# dt3 <- setcolorder(dt3, c("nrow", "V1", "V2", "year")) # to get exactly same dt
# dt3 <- dt3[order(nrow)]
dt3
}
Benchmark
library(bench)
res <- mark(
original = original(),
sb = sb(),
gregor = gregor(),
bindlist = bindlist(),
iterations = 1,
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
res
#> # A tibble: 4 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 original 5.88s 5.88s 0.170 1.72GB 16.0
#> 2 sb 1.76s 1.76s 0.570 1.73GB 0.570
#> 3 gregor 1.87s 1.87s 0.536 972.86MB 0
#> 4 bindlist 558.69ms 558.69ms 1.79 1.12GB 0
summary(res, relative = TRUE)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 original 10.5 10.5 1 1.81 Inf
#> 2 sb 3.14 3.14 3.35 1.82 Inf
#> 3 gregor 3.34 3.34 3.15 1 NaN
#> 4 bindlist 1 1 10.5 1.18 NaN
Created on 2019-12-03 by the reprex package (v0.3.0)
Now the results are not exactly the same (see commented code in my solution for correcting it) but equivalent to what you are trying to do. My lapply
plus rbindlist
solution is suprisingly the fastet by a factor of more than 3. This might change on the full task but I doubt it.
回答2:
I would think of this problem as a cross join. There is no built-in way to do cross joins between two data tables (the CJ
function works on vectors), but from the discussion on this issue, this function works well:
CJDT <- function(...) {
Reduce(function(DT1, DT2) cbind(DT1, DT2[rep(1:.N, each=nrow(DT1))]), list(...))
}
Using your large example, this works for me:
years = data.table(year = 1:10, key = "year")
setkey(dt)
dt3 = CJDT(dt, years)
Where your method takes longer before running out of memory.
回答3:
As mentioned in comments I have a suspicion that the premise of the question might be questionable. In any case, here is a slightly faster alternative:
setkey(dt)
dt[CJ(V1, year = 1:10)]
Benchmarking:
dt <- data.table(cbind(1:50000000, 50000000:1))
microbenchmark::microbenchmark(
op = dt[, c(.SD, .(year = 1:10)), by = 1:nrow(dt) ],
sb = {setkey(dt); dt[CJ(V1, year = 1:10)]},
gr = {setkey(dt); CJDT(dt, data.table(year = 1:10, key = "year"))},
times = 1
)
Unit: seconds
expr min lq mean median uq max neval
op 171.67241 171.67241 171.67241 171.67241 171.67241 171.67241 1
sb 136.00782 136.00782 136.00782 136.00782 136.00782 136.00782 1
gr 45.14151 45.14151 45.14151 45.14151 45.14151 45.14151 1
来源:https://stackoverflow.com/questions/59159729/data-table-efficient-recycling