I have a dataset with around 25 million rows. I am taking a subset of these rows and performing a function which works fine. However, what I then need to do is update the values
The answer provided by David Arenburg in his comment explains how to join the subset of modified data back into the original data.table
.
However, I wonder why the OP doesn't apply the changes directly in the original data.table
by reference using a function which returns a list:
my_fun <- function(alloc, assig) {
list(
alloc + 1,
"B")
}
With this function the subset of rows can be updated directly within the data.table
:
dt[5:2004, c("ALLOCATED", "ASSIGNED") := my_fun(ALLOCATED, ASSIGNED)]
dt[1:7]
# AREA_CD ALLOCATED ASSIGNED ID_CD
#1: 1944 0 A ID1
#2: 3265 0 A ID2
#3: 15415 0 A ID3
#4: 14121 0 A ID4
#5: 10546 1 B ID5
#6: 2263 1 B ID6
#7: 12339 1 B ID7
Due to memory limitations only a smaller data set with 2.5 million rows (instead of 25 million in the OP) is used.
library(microbenchmark)
setDT(df) # coerce df to data.table
microbenchmark(
copy = dt <- copy(df),
join = {
dt <- copy(df)
sub_dt <- dt[5:2004,]
sub_dt[,ALLOCATED:=ALLOCATED+1]
sub_dt[,ASSIGNED:="B"]
dt[sub_dt, `:=`(ALLOCATED = i.ALLOCATED, ASSIGNED = i.ASSIGNED), on = .(ID_CD)]
},
byref = {
dt <- copy(df)
dt[5:2004, c("ALLOCATED", "ASSIGNED") := my_fun(ALLOCATED, ASSIGNED)]
},
times = 10L
)
#Unit: milliseconds
# expr min lq mean median uq max neval
# copy 13.80400 14.07850 28.22882 14.15836 14.39643 154.70570 10
# join 239.36476 240.72745 244.27668 243.52967 246.17104 255.06271 10
# byref 14.28806 14.47308 15.00056 14.63147 14.73134 18.71181 10
Updating the data.table
"in place" is much faster than creating a subset and later join. The copy operation is required to start every benchmark run with an unmodified version of dt
. Therefore, the copy operation is benchmarked as well.
data.table
version 1.10.4 was used.