Join dataframes by id and overlapping date range

前端 未结 3 574
暗喜
暗喜 2021-01-15 15:34

I have two dataframes x and y that contain columns for ids and for dates.

id.x <- c(1, 2, 4, 5, 7, 8, 10)
date.x <- as.Date(c(\"2015-01-01\", \"201         


        
相关标签:
3条回答
  • 2021-01-15 16:10

    Using the development version of data.table, v1.9.7, where non-equi (or conditional) joins was recently implemented, we can do this in a straightforward (and efficient) manner.. See installation instructions here.

    require(data.table) # v1.9.7+
    setDT(x)
    setDT(y) ## convert both data.frames to data.tables by reference
    
    x[, date.x.plus3 := date.x + 3L]
    y[x, .(id.x, date.x, date.y=x.date.y), 
         on=.(id.y == id.x, date.y >= date.x, date.y <= date.x.plus3)]
    #    id.x     date.x     date.y
    # 1:    1 2015-01-01 2015-01-03
    # 2:    2 2015-01-02       <NA>
    # 3:    4 2015-01-21       <NA>
    # 4:    5 2015-01-13       <NA>
    # 5:    7 2015-01-29 2015-01-29
    # 6:    8 2015-01-01       <NA>
    # 7:   10 2015-01-03       <NA>
    

    Solutions that join on a dummy column and then filter based on the conditions are generally not scalable (as the number of rows quickly explode), and solutions that loop through rows and run the filtering condition for each row are slow, well, because they perform the operation row-wise.

    This solution does neither, i.e., performs the conditional join directly, and therefore should be performant both in terms of runtime and memory.

    0 讨论(0)
  • 2021-01-15 16:13

    Using inner join of y and x data tables by setting the keys to id of both datatables, and then checking for date conditions, and finally extract the true ones.

    library("data.table")
    
    x <- as.data.table(x)
    
    y <- as.data.table(y)
    
    setkey(x, id.x)
    
    setkey(y, id.y)
    
    z <- y[x, nomatch = 0][, j = .(is_true = ((date.y <= date.x + 3) & (date.y > date.x)), id.y, date.x, date.y)][i = is_true == TRUE]
    
    > z
       is_true id.y     date.x     date.y
    1:    TRUE    1 2015-01-01 2015-01-03
    
    0 讨论(0)
  • 2021-01-15 16:16

    You can create an ifelse statement that creates a vector that is equal to date.x if date.y <= date.x + 3 and date.y >= date.x and equal to date.y otherwise. Then merge the two based on this vector:

    id.x <- c(1, 2, 4, 5, 7, 8, 10)
    date.x <- as.Date(c("2015-01-01", "2015-01-02", "2015-01-21", "2015-01-13", "2015-01-29", "2015-01-01", "2015-01-03"),format = "%Y-%m-%d")
    x <- cbind.data.frame(id.x, date.x)
    id.y <- c(1, 2, 3, 6, 7, 8, 9)
    date.y <- as.Date(c("2015-01-03", "2015-01-29", "2015-01-22", "2015-01-13", "2015-01-29", "2014-12-31", "2015-01-03"), format = "%Y-%m-%d")
    y <- cbind.data.frame(id.y, date.y)
    
    safe.ifelse <- function(cond, yes, no) structure(ifelse(cond, yes, no), class = class(yes))
    
    match <- safe.ifelse(date.y <= date.x+3 & date.y >= date.x, 
                match <- date.x,
                match <- date.y)
    
    y$date.x <- match
    names(y)[1] <- "id.x"
    
    dplyr::left_join(x, y, by=c("id.x","date.x"))
    
      id.x     date.x     date.y
    1    1 2015-01-01 2015-01-03
    2    2 2015-01-02       <NA>
    3    4 2015-01-21       <NA>
    4    5 2015-01-13       <NA>
    5    7 2015-01-29 2015-01-29
    6    8 2015-01-01       <NA>
    7   10 2015-01-03       <NA>
    

    I borrowed the safe.ifelse function from this post because the base ifelse statement results in a numeric vector rather than a date vector.

    0 讨论(0)
提交回复
热议问题