While preparing an answer to the question dplyr or data.table to calculate time series aggregations in R I noticed that I do get different results depending on whether the table
The grouping by=.EACHI
means "by each i" not "by each x".
# for readability / my sanity
DT = copy(DT0)
setnames(DT, "hospitalization.date", "h.date")
z = DT[DT, on = .(patient.id, h.date >= start.date, h.date <= end.date),
.(x.h.date, patient.id, i.start.date, i.end.date, g = .GRP, .N)
, by=.EACHI][, utils:::tail.default(.SD, 6)]
x.h.date patient.id i.start.date i.end.date g N
1: 2013-10-15 1 2012-10-15 2013-10-15 1 1 *
2: 2015-07-16 1 2014-07-16 2015-07-16 2 1
3: 2015-07-16 1 2015-01-07 2016-01-07 3 2 *
4: 2016-01-07 1 2015-01-07 2016-01-07 3 2 *
5: 2014-10-15 2 2013-10-15 2014-10-15 4 1 *
6: 2015-12-20 2 2014-12-20 2015-12-20 5 1
7: 2015-12-20 2 2014-12-25 2015-12-25 6 2
8: 2015-12-25 2 2014-12-25 2015-12-25 6 2
9: 2015-12-20 2 2015-02-10 2016-02-10 7 3 *
10: 2015-12-25 2 2015-02-10 2016-02-10 7 3 *
11: 2016-02-10 2 2015-02-10 2016-02-10 7 3 *
For patient 1, the groups are
.(start.date = 2012-10-15, end.date = 2013-10-15)
, count of 1.(start.date = 2014-07-16, end.date = 2015-07-16)
, count of 1.(start.date = 2015-01-07, end.date = 2016-01-07)
, count of 2It is just by luck that there are both seven groups in this join and seven rows in the original table.
For the tougher issue, I'll borrow an example from my notes:
Beware multiple matches in an update join. When there are multiple matches, an update join will apparently only use the last one. Unfortunately, this is done silently. Try:
a = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) b = data.table(id = 1:2, y = c(11L, 15L)) b[a, on=.(id), x := i.x, verbose = TRUE ][] # Calculated ad hoc index in 0 secs # Starting bmerge ...done in 0.02 secs # Detected that j uses these columns: x,i.x # Assigning to 3 row subset of 2 rows # id y x # 1: 1 11 12 # 2: 2 15 13
With verbose on, we see a helpful message about assignment “to 3 row subset of 2 rows.”
-- modified from "Quick R Tutorial", section "Updating in a join"
In the OP's case, verbose=TRUE
does not offer such a message, unfortunately.
DT[DT, on = .(patient.id, h.date >= start.date, h.date <= end.date),
n := .N, by = .EACHI, verbose=TRUE]
# Non-equi join operators detected ...
# forder took ... 0.01 secs
# Generating group lengths ... done in 0 secs
# Generating non-equi group ids ... done in 0 secs
# Found 1 non-equi group(s) ...
# Starting bmerge ...done in 0.02 secs
# Detected that j uses these columns: <none>
# lapply optimization is on, j unchanged as '.N'
# Making each group and running j (GForce FALSE) ...
# memcpy contiguous groups took 0.000s for 7 groups
# eval(j) took 0.000s for 7 calls
# 0.01 secs
However, we can see that the last row per x
group does contain the value the OP sees. I've manually marked these with asterisks above. Alternately, you could mark them with z[, mrk := replace(rep(0, .N), .N, 1), by=x.h.date]
.
For reference, the update join here is...
DT[, n :=
.SD[.SD, on = .(patient.id, h.date >= start.date, h.date <= end.date), .N, by=.EACHI]$N
]
patient.id hospitalization.date start.date end.date h.date n
1: 1 2013-10-15 2012-10-15 2013-10-15 2013-10-15 1
2: 1 2015-07-16 2014-07-16 2015-07-16 2015-07-16 1
3: 1 2016-01-07 2015-01-07 2016-01-07 2016-01-07 2
4: 2 2014-10-15 2013-10-15 2014-10-15 2014-10-15 1
5: 2 2015-12-20 2014-12-20 2015-12-20 2015-12-20 1
6: 2 2015-12-25 2014-12-25 2015-12-25 2015-12-25 2
7: 2 2016-02-10 2015-02-10 2016-02-10 2016-02-10 3
This is the correct/idiomatic way to handle this case, of adding columns to x
based on looking up each row of x
in another table and computing a summary of the result:
x[, v := DT2[.SD, on=, j, by=.EACHI]$V1 ]