I\'m trying to enrich one dataset (adherence) based on subsets from another (lsr). For each individual row in adherence, I want to calculate (as a third column) the medicati
This can be solved by updating in a non-equi join.
This avoids the memory issues caused by a cartesian join or by calling apply()
which coerces a data.frame or data.table to a matrix which involves copying the data.
In addition, the OP has mentioned that lsr
has a few hundred mio. rows and adherence
has 1.5 mio rows (500 timeperiods times 3000 ID
's). Therefore, efficient storage of data items will not only reduce the memory footprint but may also reduce the share of processing time which is required for loading data.
library(data.table)
# coerce to data.table by reference, i.e., without copying
setDT(adherence)
setDT(lsr)
# coerce to IDate to save memory
adherence[, year := as.IDate(year)]
cols <- c("eksd", "ENDDATE")
lsr[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
# update in a non-equi join
adherence[lsr, on = .(ID, year >= eksd, year < ENDDATE),
AH := as.integer(ENDDATE - x.year)][]
ID year AH 1: 1 2013-01-01 NA 2: 2 2013-01-01 NA 3: 3 2013-01-01 NA 4: 1 2013-02-01 64 5: 2 2013-02-01 NA 6: 3 2013-02-01 63
Note that NA
indicates that no match was found. If required, the AH
column can be initialised before the non-equi join by adherence[, AH := 0L]
.
The code to create the sample datasets can be streamlined:
adherence <- data.frame(
ID = c("1", "2", "3", "1", "2", "3"),
year = as.Date(c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01")),
stringsAsFactors = FALSE)
lsr <- data.frame(
ID = c("1", "1", "1", "2", "2", "2", "3", "3"),
eksd = as.Date(c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05")),
DDD = as.integer(c("60", "90", "90", "60", "120", "60", "30", "90")),
stringsAsFactors = FALSE)
lsr$ENDDATE <- lsr$eksd + lsr$DDD
Note that DDD
is of type integer which usually requires 4 bytes instead of 8 bytes for type numeric/double.
Also note that the last statement may cause the whole data object lsr
to be copied. This can be avoided by using data.table syntax which updates by reference.
library(data.table)
setDT(lsr)[, ENDDATE := eksd + DDD][]