Odd behaviour of data.table's update on non-equi self-join

问题

While preparing an answer to the question dplyr or data.table to calculate time series aggregations in R I noticed that I do get different results depending on whether the table is updated in place or returned as a new object. Also, I do get different result when I change the order of columns in the non-equi join conditions.

Currently, I don't have an explanation for this, perhaps due to a major misunderstanding on my side or a simple coding error.

Please, note that this question is asking particularly for explanations of the observed behaviour of data.table joins. If you have alternative solutions to the underlying problem, please, feel free to post an answer to the original question.

Original question and working answer

The original question was how to count the number of hospitalizations occuring in the 365 days before that hospitalization (including the actual one) for each patient using these data:

library(data.table)   # version 1.10.4 (CRAN) or 1.10.5 (devel built 2017-08-19)
DT0 <- data.table(
  patient.id = c(1L, 2L, 1L, 1L, 2L, 2L, 2L),
  hospitalization.date = as.Date(c("2013/10/15", "2014/10/15", "2015/7/16", "2016/1/7", 
                                   "2015/12/20", "2015/12/25", "2016/2/10")))
setorder(DT0, patient.id, hospitalization.date)
DT0

   patient.id hospitalization.date
1:          1           2013-10-15
2:          1           2015-07-16
3:          1           2016-01-07
4:          2           2014-10-15
5:          2           2015-12-20
6:          2           2015-12-25
7:          2           2016-02-10

The code below gives the expected answer (additional helper column added here for clarity)

# add helper columns
DT0[, start.date := hospitalization.date - 365][
  , end.date := hospitalization.date][]
DT0

   patient.id hospitalization.date start.date   end.date
1:          1           2013-10-15 2012-10-15 2013-10-15
2:          1           2015-07-16 2014-07-16 2015-07-16
3:          1           2016-01-07 2015-01-07 2016-01-07
4:          2           2014-10-15 2013-10-15 2014-10-15
5:          2           2015-12-20 2014-12-20 2015-12-20
6:          2           2015-12-25 2014-12-25 2015-12-25
7:          2           2016-02-10 2015-02-10 2016-02-10

result <- DT0[DT0, on = c("patient.id", "hospitalization.date>=start.date", 
              "hospitalization.date<=end.date"), 
   .(hospitalizations.last.year = .N), by = .EACHI][]
result

   patient.id hospitalization.date hospitalization.date hospitalizations.last.year
1:          1           2012-10-15           2013-10-15                          1
2:          1           2014-07-16           2015-07-16                          1
3:          1           2015-01-07           2016-01-07                          2
4:          2           2013-10-15           2014-10-15                          1
5:          2           2014-12-20           2015-12-20                          1
6:          2           2014-12-25           2015-12-25                          2
7:          2           2015-02-10           2016-02-10                          3

except for the renamed and duplicated column names (which are left as is for comparison).

For patient.id == 2, the result in last row is 3 because the patient was hospitalized on 2016-02-10 for the third time since 2015-02-10.

Update on join in place

result is a new data.table object which occupies additional memory. I tried to update the original data.table object in place using:

# use copy of DT0 which can be safely modified
DT <- copy(DT0)

DT[DT, on = c("patient.id", "hospitalization.date>=start.date", 
            "hospitalization.date<=end.date"), 
   hospitalizations.last.year := .N, by = .EACHI]
DT

   patient.id hospitalization.date start.date   end.date hospitalizations.last.year
1:          1           2013-10-15 2012-10-15 2013-10-15                          1
2:          1           2015-07-16 2014-07-16 2015-07-16                          2
3:          1           2016-01-07 2015-01-07 2016-01-07                          2
4:          2           2014-10-15 2013-10-15 2014-10-15                          1
5:          2           2015-12-20 2014-12-20 2015-12-20                          3
6:          2           2015-12-25 2014-12-25 2015-12-25                          3
7:          2           2016-02-10 2015-02-10 2016-02-10                          3

DThas now been updated in place but rows 5 and 6 show now 3 hospitalizations instead of 1 or 2, resp. It seems that now the total number of hospitalizations within the last period is returned for each of the rows.

Change the order of columns in the conditions.

Also the order of columns in the non-equi join conditions do matter, even in a self-join:

result <- DT0[DT0, on = c("patient.id", "start.date<=hospitalization.date", 
                          "end.date>=hospitalization.date"), 
              .(hospitalizations.last.year = .N), by = .EACHI][]
result

My expectation was that "start.date<=hospitalization.date" would be equivalent to "hospitalization.date>=start.date" (note that also < and > were switched) but the result

   patient.id start.date   end.date hospitalizations.last.year
1:          1 2013-10-15 2013-10-15                          1
2:          1 2015-07-16 2015-07-16                          2
3:          1 2016-01-07 2016-01-07                          1
4:          2 2014-10-15 2014-10-15                          1
5:          2 2015-12-20 2015-12-20                          3
6:          2 2015-12-25 2015-12-25                          2
7:          2 2016-02-10 2016-02-10                          1

is different. It seems that now the number of forthcoming hospitalizations are being counted

Interestingly, the update in place does now return the same result (except for some of the column names):

# use copy of DT0 which can be safely modified
DT <- copy(DT0)
DT[DT, on = c("patient.id", "start.date<=hospitalization.date", 
              "end.date>=hospitalization.date"), 
   hospitalizations.last.year := .N, by = .EACHI]
DT

   patient.id hospitalization.date start.date   end.date hospitalizations.last.year
1:          1           2013-10-15 2012-10-15 2013-10-15                          1
2:          1           2015-07-16 2014-07-16 2015-07-16                          2
3:          1           2016-01-07 2015-01-07 2016-01-07                          1
4:          2           2014-10-15 2013-10-15 2014-10-15                          1
5:          2           2015-12-20 2014-12-20 2015-12-20                          3
6:          2           2015-12-25 2014-12-25 2015-12-25                          2
7:          2           2016-02-10 2015-02-10 2016-02-10                          1

There is a potentially related question which led to an issue reported on github.

There is an answer by Arun concerning usage of the x. prefix with non-equi joins.

回答1:

The grouping by=.EACHI means "by each i" not "by each x".

# for readability / my sanity
DT = copy(DT0)
setnames(DT, "hospitalization.date", "h.date")

z = DT[DT, on = .(patient.id, h.date >= start.date, h.date <= end.date), 
   .(x.h.date, patient.id, i.start.date, i.end.date, g = .GRP, .N)
, by=.EACHI][, utils:::tail.default(.SD, 6)]

      x.h.date patient.id i.start.date i.end.date g N
 1: 2013-10-15          1   2012-10-15 2013-10-15 1 1 * 
 2: 2015-07-16          1   2014-07-16 2015-07-16 2 1 
 3: 2015-07-16          1   2015-01-07 2016-01-07 3 2 *
 4: 2016-01-07          1   2015-01-07 2016-01-07 3 2 *
 5: 2014-10-15          2   2013-10-15 2014-10-15 4 1 *  
 6: 2015-12-20          2   2014-12-20 2015-12-20 5 1
 7: 2015-12-20          2   2014-12-25 2015-12-25 6 2  
 8: 2015-12-25          2   2014-12-25 2015-12-25 6 2 
 9: 2015-12-20          2   2015-02-10 2016-02-10 7 3 *
10: 2015-12-25          2   2015-02-10 2016-02-10 7 3 *
11: 2016-02-10          2   2015-02-10 2016-02-10 7 3 *

For patient 1, the groups are

.(start.date = 2012-10-15, end.date = 2013-10-15), count of 1
.(start.date = 2014-07-16, end.date = 2015-07-16), count of 1
.(start.date = 2015-01-07, end.date = 2016-01-07), count of 2

It is just by luck that there are both seven groups in this join and seven rows in the original table.

For the tougher issue, I'll borrow an example from my notes:

Beware multiple matches in an update join. When there are multiple matches, an update join will apparently only use the last one. Unfortunately, this is done silently. Try:
a = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), 
  t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15)
b = data.table(id = 1:2, y = c(11L, 15L))
b[a, on=.(id), x := i.x, verbose = TRUE ][]

# Calculated ad hoc index in 0 secs
# Starting bmerge ...done in 0.02 secs
# Detected that j uses these columns: x,i.x 
# Assigning to 3 row subset of 2 rows
#    id  y  x
# 1:  1 11 12
# 2:  2 15 13
With verbose on, we see a helpful message about assignment “to 3 row subset of 2 rows.”

-- modified from "Quick R Tutorial", section "Updating in a join"

In the OP's case, verbose=TRUE does not offer such a message, unfortunately.

DT[DT, on = .(patient.id, h.date >= start.date, h.date <= end.date), 
   n := .N, by = .EACHI, verbose=TRUE]
# Non-equi join operators detected ... 
#   forder took ... 0.01 secs
#   Generating group lengths ... done in 0 secs
#   Generating non-equi group ids ... done in 0 secs
#   Found 1 non-equi group(s) ...
# Starting bmerge ...done in 0.02 secs
# Detected that j uses these columns: <none> 
# lapply optimization is on, j unchanged as '.N'
# Making each group and running j (GForce FALSE) ... 
#   memcpy contiguous groups took 0.000s for 7 groups
#   eval(j) took 0.000s for 7 calls
# 0.01 secs

However, we can see that the last row per x group does contain the value the OP sees. I've manually marked these with asterisks above. Alternately, you could mark them with z[, mrk := replace(rep(0, .N), .N, 1), by=x.h.date].

For reference, the update join here is...

DT[, n := 
  .SD[.SD, on = .(patient.id, h.date >= start.date, h.date <= end.date), .N, by=.EACHI]$N 
]

   patient.id hospitalization.date start.date   end.date     h.date n
1:          1           2013-10-15 2012-10-15 2013-10-15 2013-10-15 1
2:          1           2015-07-16 2014-07-16 2015-07-16 2015-07-16 1
3:          1           2016-01-07 2015-01-07 2016-01-07 2016-01-07 2
4:          2           2014-10-15 2013-10-15 2014-10-15 2014-10-15 1
5:          2           2015-12-20 2014-12-20 2015-12-20 2015-12-20 1
6:          2           2015-12-25 2014-12-25 2015-12-25 2015-12-25 2
7:          2           2016-02-10 2015-02-10 2016-02-10 2016-02-10 3

This is the correct/idiomatic way to handle this case, of adding columns to x based on looking up each row of x in another table and computing a summary of the result:

x[, v := DT2[.SD, on=, j, by=.EACHI]$V1 ]

来源：https://stackoverflow.com/questions/45792830/odd-behaviour-of-data-tables-update-on-non-equi-self-join

标签

join

data.table

self-join

Odd behaviour of data.table's update on non-equi self-join

问题

Original question and working answer

Update on join in place

Change the order of columns in the conditions.

Related

回答1: