Odd behaviour of data.table's update on non-equi self-join

元气小坏坏 提交于 2019-12-04 11:28:00

The grouping by=.EACHI means "by each i" not "by each x".

# for readability / my sanity
DT = copy(DT0)
setnames(DT, "hospitalization.date", "h.date")

z = DT[DT, on = .(patient.id, h.date >= start.date, h.date <= end.date), 
   .(x.h.date, patient.id, i.start.date, i.end.date, g = .GRP, .N)
, by=.EACHI][, utils:::tail.default(.SD, 6)]

      x.h.date patient.id i.start.date i.end.date g N
 1: 2013-10-15          1   2012-10-15 2013-10-15 1 1 * 
 2: 2015-07-16          1   2014-07-16 2015-07-16 2 1 
 3: 2015-07-16          1   2015-01-07 2016-01-07 3 2 *
 4: 2016-01-07          1   2015-01-07 2016-01-07 3 2 *
 5: 2014-10-15          2   2013-10-15 2014-10-15 4 1 *  
 6: 2015-12-20          2   2014-12-20 2015-12-20 5 1
 7: 2015-12-20          2   2014-12-25 2015-12-25 6 2  
 8: 2015-12-25          2   2014-12-25 2015-12-25 6 2 
 9: 2015-12-20          2   2015-02-10 2016-02-10 7 3 *
10: 2015-12-25          2   2015-02-10 2016-02-10 7 3 *
11: 2016-02-10          2   2015-02-10 2016-02-10 7 3 *

For patient 1, the groups are

  • .(start.date = 2012-10-15, end.date = 2013-10-15), count of 1
  • .(start.date = 2014-07-16, end.date = 2015-07-16), count of 1
  • .(start.date = 2015-01-07, end.date = 2016-01-07), count of 2

It is just by luck that there are both seven groups in this join and seven rows in the original table.

For the tougher issue, I'll borrow an example from my notes:

Beware multiple matches in an update join. When there are multiple matches, an update join will apparently only use the last one. Unfortunately, this is done silently. Try:

a = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), 
  t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15)
b = data.table(id = 1:2, y = c(11L, 15L))
b[a, on=.(id), x := i.x, verbose = TRUE ][]

# Calculated ad hoc index in 0 secs
# Starting bmerge ...done in 0.02 secs
# Detected that j uses these columns: x,i.x 
# Assigning to 3 row subset of 2 rows
#    id  y  x
# 1:  1 11 12
# 2:  2 15 13

With verbose on, we see a helpful message about assignment “to 3 row subset of 2 rows.”

-- modified from "Quick R Tutorial", section "Updating in a join"

In the OP's case, verbose=TRUE does not offer such a message, unfortunately.

DT[DT, on = .(patient.id, h.date >= start.date, h.date <= end.date), 
   n := .N, by = .EACHI, verbose=TRUE]
# Non-equi join operators detected ... 
#   forder took ... 0.01 secs
#   Generating group lengths ... done in 0 secs
#   Generating non-equi group ids ... done in 0 secs
#   Found 1 non-equi group(s) ...
# Starting bmerge ...done in 0.02 secs
# Detected that j uses these columns: <none> 
# lapply optimization is on, j unchanged as '.N'
# Making each group and running j (GForce FALSE) ... 
#   memcpy contiguous groups took 0.000s for 7 groups
#   eval(j) took 0.000s for 7 calls
# 0.01 secs

However, we can see that the last row per x group does contain the value the OP sees. I've manually marked these with asterisks above. Alternately, you could mark them with z[, mrk := replace(rep(0, .N), .N, 1), by=x.h.date].


For reference, the update join here is...

DT[, n := 
  .SD[.SD, on = .(patient.id, h.date >= start.date, h.date <= end.date), .N, by=.EACHI]$N 
]

   patient.id hospitalization.date start.date   end.date     h.date n
1:          1           2013-10-15 2012-10-15 2013-10-15 2013-10-15 1
2:          1           2015-07-16 2014-07-16 2015-07-16 2015-07-16 1
3:          1           2016-01-07 2015-01-07 2016-01-07 2016-01-07 2
4:          2           2014-10-15 2013-10-15 2014-10-15 2014-10-15 1
5:          2           2015-12-20 2014-12-20 2015-12-20 2015-12-20 1
6:          2           2015-12-25 2014-12-25 2015-12-25 2015-12-25 2
7:          2           2016-02-10 2015-02-10 2016-02-10 2016-02-10 3

This is the correct/idiomatic way to handle this case, of adding columns to x based on looking up each row of x in another table and computing a summary of the result:

x[, v := DT2[.SD, on=, j, by=.EACHI]$V1 ]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!