While preparing an answer to the question dplyr or data.table to calculate time series aggregations in R I noticed that I do get different results depending on whether the table is updated in place or returned as a new object. Also, I do get different result when I change the order of columns in the non-equi join conditions.
Currently, I don't have an explanation for this, perhaps due to a major misunderstanding on my side or a simple coding error.
Please, note that this question is asking particularly for explanations of the observed behaviour of
data.table
joins. If you have alternative solutions to the underlying problem, please, feel free to post an answer to the original question.
Original question and working answer
The original question was how to count the number of hospitalizations occuring in the 365 days before that hospitalization (including the actual one) for each patient using these data:
library(data.table) # version 1.10.4 (CRAN) or 1.10.5 (devel built 2017-08-19)
DT0 <- data.table(
patient.id = c(1L, 2L, 1L, 1L, 2L, 2L, 2L),
hospitalization.date = as.Date(c("2013/10/15", "2014/10/15", "2015/7/16", "2016/1/7",
"2015/12/20", "2015/12/25", "2016/2/10")))
setorder(DT0, patient.id, hospitalization.date)
DT0
patient.id hospitalization.date 1: 1 2013-10-15 2: 1 2015-07-16 3: 1 2016-01-07 4: 2 2014-10-15 5: 2 2015-12-20 6: 2 2015-12-25 7: 2 2016-02-10
The code below gives the expected answer (additional helper column added here for clarity)
# add helper columns
DT0[, start.date := hospitalization.date - 365][
, end.date := hospitalization.date][]
DT0
patient.id hospitalization.date start.date end.date 1: 1 2013-10-15 2012-10-15 2013-10-15 2: 1 2015-07-16 2014-07-16 2015-07-16 3: 1 2016-01-07 2015-01-07 2016-01-07 4: 2 2014-10-15 2013-10-15 2014-10-15 5: 2 2015-12-20 2014-12-20 2015-12-20 6: 2 2015-12-25 2014-12-25 2015-12-25 7: 2 2016-02-10 2015-02-10 2016-02-10
result <- DT0[DT0, on = c("patient.id", "hospitalization.date>=start.date",
"hospitalization.date<=end.date"),
.(hospitalizations.last.year = .N), by = .EACHI][]
result
patient.id hospitalization.date hospitalization.date hospitalizations.last.year 1: 1 2012-10-15 2013-10-15 1 2: 1 2014-07-16 2015-07-16 1 3: 1 2015-01-07 2016-01-07 2 4: 2 2013-10-15 2014-10-15 1 5: 2 2014-12-20 2015-12-20 1 6: 2 2014-12-25 2015-12-25 2 7: 2 2015-02-10 2016-02-10 3
except for the renamed and duplicated column names (which are left as is for comparison).
For patient.id == 2
, the result in last row is 3 because the patient was hospitalized on 2016-02-10 for the third time since 2015-02-10.
Update on join in place
result
is a new data.table
object which occupies additional memory. I tried to update the original data.table
object in place using:
# use copy of DT0 which can be safely modified
DT <- copy(DT0)
DT[DT, on = c("patient.id", "hospitalization.date>=start.date",
"hospitalization.date<=end.date"),
hospitalizations.last.year := .N, by = .EACHI]
DT
patient.id hospitalization.date start.date end.date hospitalizations.last.year 1: 1 2013-10-15 2012-10-15 2013-10-15 1 2: 1 2015-07-16 2014-07-16 2015-07-16 2 3: 1 2016-01-07 2015-01-07 2016-01-07 2 4: 2 2014-10-15 2013-10-15 2014-10-15 1 5: 2 2015-12-20 2014-12-20 2015-12-20 3 6: 2 2015-12-25 2014-12-25 2015-12-25 3 7: 2 2016-02-10 2015-02-10 2016-02-10 3
DT
has now been updated in place but rows 5 and 6 show now 3 hospitalizations instead of 1 or 2, resp. It seems that now the total number of hospitalizations within the last period is returned for each of the rows.
Change the order of columns in the conditions.
Also the order of columns in the non-equi join conditions do matter, even in a self-join:
result <- DT0[DT0, on = c("patient.id", "start.date<=hospitalization.date",
"end.date>=hospitalization.date"),
.(hospitalizations.last.year = .N), by = .EACHI][]
result
My expectation was that "start.date<=hospitalization.date"
would be equivalent to "hospitalization.date>=start.date"
(note that also <
and >
were switched) but the result
patient.id start.date end.date hospitalizations.last.year 1: 1 2013-10-15 2013-10-15 1 2: 1 2015-07-16 2015-07-16 2 3: 1 2016-01-07 2016-01-07 1 4: 2 2014-10-15 2014-10-15 1 5: 2 2015-12-20 2015-12-20 3 6: 2 2015-12-25 2015-12-25 2 7: 2 2016-02-10 2016-02-10 1
is different. It seems that now the number of forthcoming hospitalizations are being counted
Interestingly, the update in place does now return the same result (except for some of the column names):
# use copy of DT0 which can be safely modified
DT <- copy(DT0)
DT[DT, on = c("patient.id", "start.date<=hospitalization.date",
"end.date>=hospitalization.date"),
hospitalizations.last.year := .N, by = .EACHI]
DT
patient.id hospitalization.date start.date end.date hospitalizations.last.year 1: 1 2013-10-15 2012-10-15 2013-10-15 1 2: 1 2015-07-16 2014-07-16 2015-07-16 2 3: 1 2016-01-07 2015-01-07 2016-01-07 1 4: 2 2014-10-15 2013-10-15 2014-10-15 1 5: 2 2015-12-20 2014-12-20 2015-12-20 3 6: 2 2015-12-25 2014-12-25 2015-12-25 2 7: 2 2016-02-10 2015-02-10 2016-02-10 1
Related
There is a potentially related question which led to an issue reported on github.
There is an answer by Arun concerning usage of the x.
prefix with non-equi joins.
The grouping by=.EACHI
means "by each i" not "by each x".
# for readability / my sanity
DT = copy(DT0)
setnames(DT, "hospitalization.date", "h.date")
z = DT[DT, on = .(patient.id, h.date >= start.date, h.date <= end.date),
.(x.h.date, patient.id, i.start.date, i.end.date, g = .GRP, .N)
, by=.EACHI][, utils:::tail.default(.SD, 6)]
x.h.date patient.id i.start.date i.end.date g N
1: 2013-10-15 1 2012-10-15 2013-10-15 1 1 *
2: 2015-07-16 1 2014-07-16 2015-07-16 2 1
3: 2015-07-16 1 2015-01-07 2016-01-07 3 2 *
4: 2016-01-07 1 2015-01-07 2016-01-07 3 2 *
5: 2014-10-15 2 2013-10-15 2014-10-15 4 1 *
6: 2015-12-20 2 2014-12-20 2015-12-20 5 1
7: 2015-12-20 2 2014-12-25 2015-12-25 6 2
8: 2015-12-25 2 2014-12-25 2015-12-25 6 2
9: 2015-12-20 2 2015-02-10 2016-02-10 7 3 *
10: 2015-12-25 2 2015-02-10 2016-02-10 7 3 *
11: 2016-02-10 2 2015-02-10 2016-02-10 7 3 *
For patient 1, the groups are
.(start.date = 2012-10-15, end.date = 2013-10-15)
, count of 1.(start.date = 2014-07-16, end.date = 2015-07-16)
, count of 1.(start.date = 2015-01-07, end.date = 2016-01-07)
, count of 2
It is just by luck that there are both seven groups in this join and seven rows in the original table.
For the tougher issue, I'll borrow an example from my notes:
Beware multiple matches in an update join. When there are multiple matches, an update join will apparently only use the last one. Unfortunately, this is done silently. Try:
a = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15) b = data.table(id = 1:2, y = c(11L, 15L)) b[a, on=.(id), x := i.x, verbose = TRUE ][] # Calculated ad hoc index in 0 secs # Starting bmerge ...done in 0.02 secs # Detected that j uses these columns: x,i.x # Assigning to 3 row subset of 2 rows # id y x # 1: 1 11 12 # 2: 2 15 13
With verbose on, we see a helpful message about assignment “to 3 row subset of 2 rows.”
-- modified from "Quick R Tutorial", section "Updating in a join"
In the OP's case, verbose=TRUE
does not offer such a message, unfortunately.
DT[DT, on = .(patient.id, h.date >= start.date, h.date <= end.date),
n := .N, by = .EACHI, verbose=TRUE]
# Non-equi join operators detected ...
# forder took ... 0.01 secs
# Generating group lengths ... done in 0 secs
# Generating non-equi group ids ... done in 0 secs
# Found 1 non-equi group(s) ...
# Starting bmerge ...done in 0.02 secs
# Detected that j uses these columns: <none>
# lapply optimization is on, j unchanged as '.N'
# Making each group and running j (GForce FALSE) ...
# memcpy contiguous groups took 0.000s for 7 groups
# eval(j) took 0.000s for 7 calls
# 0.01 secs
However, we can see that the last row per x
group does contain the value the OP sees. I've manually marked these with asterisks above. Alternately, you could mark them with z[, mrk := replace(rep(0, .N), .N, 1), by=x.h.date]
.
For reference, the update join here is...
DT[, n :=
.SD[.SD, on = .(patient.id, h.date >= start.date, h.date <= end.date), .N, by=.EACHI]$N
]
patient.id hospitalization.date start.date end.date h.date n
1: 1 2013-10-15 2012-10-15 2013-10-15 2013-10-15 1
2: 1 2015-07-16 2014-07-16 2015-07-16 2015-07-16 1
3: 1 2016-01-07 2015-01-07 2016-01-07 2016-01-07 2
4: 2 2014-10-15 2013-10-15 2014-10-15 2014-10-15 1
5: 2 2015-12-20 2014-12-20 2015-12-20 2015-12-20 1
6: 2 2015-12-25 2014-12-25 2015-12-25 2015-12-25 2
7: 2 2016-02-10 2015-02-10 2016-02-10 2016-02-10 3
This is the correct/idiomatic way to handle this case, of adding columns to x
based on looking up each row of x
in another table and computing a summary of the result:
x[, v := DT2[.SD, on=, j, by=.EACHI]$V1 ]
来源:https://stackoverflow.com/questions/45792830/odd-behaviour-of-data-tables-update-on-non-equi-self-join