Odd behaviour of data.table's update on non-equi self-join

前端 未结 1 659
醉梦人生
醉梦人生 2021-02-09 15:38

While preparing an answer to the question dplyr or data.table to calculate time series aggregations in R I noticed that I do get different results depending on whether the table

1条回答
  •  南旧
    南旧 (楼主)
    2021-02-09 16:18

    The grouping by=.EACHI means "by each i" not "by each x".

    # for readability / my sanity
    DT = copy(DT0)
    setnames(DT, "hospitalization.date", "h.date")
    
    z = DT[DT, on = .(patient.id, h.date >= start.date, h.date <= end.date), 
       .(x.h.date, patient.id, i.start.date, i.end.date, g = .GRP, .N)
    , by=.EACHI][, utils:::tail.default(.SD, 6)]
    
          x.h.date patient.id i.start.date i.end.date g N
     1: 2013-10-15          1   2012-10-15 2013-10-15 1 1 * 
     2: 2015-07-16          1   2014-07-16 2015-07-16 2 1 
     3: 2015-07-16          1   2015-01-07 2016-01-07 3 2 *
     4: 2016-01-07          1   2015-01-07 2016-01-07 3 2 *
     5: 2014-10-15          2   2013-10-15 2014-10-15 4 1 *  
     6: 2015-12-20          2   2014-12-20 2015-12-20 5 1
     7: 2015-12-20          2   2014-12-25 2015-12-25 6 2  
     8: 2015-12-25          2   2014-12-25 2015-12-25 6 2 
     9: 2015-12-20          2   2015-02-10 2016-02-10 7 3 *
    10: 2015-12-25          2   2015-02-10 2016-02-10 7 3 *
    11: 2016-02-10          2   2015-02-10 2016-02-10 7 3 *
    

    For patient 1, the groups are

    • .(start.date = 2012-10-15, end.date = 2013-10-15), count of 1
    • .(start.date = 2014-07-16, end.date = 2015-07-16), count of 1
    • .(start.date = 2015-01-07, end.date = 2016-01-07), count of 2

    It is just by luck that there are both seven groups in this join and seven rows in the original table.

    For the tougher issue, I'll borrow an example from my notes:

    Beware multiple matches in an update join. When there are multiple matches, an update join will apparently only use the last one. Unfortunately, this is done silently. Try:

    a = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), 
      t = c(1L, 2L, 1L, 2L, NA_integer_), x = 11:15)
    b = data.table(id = 1:2, y = c(11L, 15L))
    b[a, on=.(id), x := i.x, verbose = TRUE ][]
    
    # Calculated ad hoc index in 0 secs
    # Starting bmerge ...done in 0.02 secs
    # Detected that j uses these columns: x,i.x 
    # Assigning to 3 row subset of 2 rows
    #    id  y  x
    # 1:  1 11 12
    # 2:  2 15 13
    

    With verbose on, we see a helpful message about assignment “to 3 row subset of 2 rows.”

    -- modified from "Quick R Tutorial", section "Updating in a join"

    In the OP's case, verbose=TRUE does not offer such a message, unfortunately.

    DT[DT, on = .(patient.id, h.date >= start.date, h.date <= end.date), 
       n := .N, by = .EACHI, verbose=TRUE]
    # Non-equi join operators detected ... 
    #   forder took ... 0.01 secs
    #   Generating group lengths ... done in 0 secs
    #   Generating non-equi group ids ... done in 0 secs
    #   Found 1 non-equi group(s) ...
    # Starting bmerge ...done in 0.02 secs
    # Detected that j uses these columns:  
    # lapply optimization is on, j unchanged as '.N'
    # Making each group and running j (GForce FALSE) ... 
    #   memcpy contiguous groups took 0.000s for 7 groups
    #   eval(j) took 0.000s for 7 calls
    # 0.01 secs
    

    However, we can see that the last row per x group does contain the value the OP sees. I've manually marked these with asterisks above. Alternately, you could mark them with z[, mrk := replace(rep(0, .N), .N, 1), by=x.h.date].


    For reference, the update join here is...

    DT[, n := 
      .SD[.SD, on = .(patient.id, h.date >= start.date, h.date <= end.date), .N, by=.EACHI]$N 
    ]
    
       patient.id hospitalization.date start.date   end.date     h.date n
    1:          1           2013-10-15 2012-10-15 2013-10-15 2013-10-15 1
    2:          1           2015-07-16 2014-07-16 2015-07-16 2015-07-16 1
    3:          1           2016-01-07 2015-01-07 2016-01-07 2016-01-07 2
    4:          2           2014-10-15 2013-10-15 2014-10-15 2014-10-15 1
    5:          2           2015-12-20 2014-12-20 2015-12-20 2015-12-20 1
    6:          2           2015-12-25 2014-12-25 2015-12-25 2015-12-25 2
    7:          2           2016-02-10 2015-02-10 2016-02-10 2016-02-10 3
    

    This is the correct/idiomatic way to handle this case, of adding columns to x based on looking up each row of x in another table and computing a summary of the result:

    x[, v := DT2[.SD, on=, j, by=.EACHI]$V1 ]
    

    0 讨论(0)
提交回复
热议问题