Merge dataframes on matching A, B and *closest* C?

前端 未结 3 1228
走了就别回头了
走了就别回头了 2020-12-05 11:14

I have two dataframes like so:

set.seed(1)
df <- cbind(expand.grid(x=1:3, y=1:5), time=round(runif(15)*30))
to.merge <- data.frame(x=c(2, 2, 2, 3, 2),
         


        
相关标签:
3条回答
  • 2020-12-05 11:57

    Using merge couple of times and aggregate once, here is how to do it.

    set.seed(1)
    df <- cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30))
    to.merge <- data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F)
    
    #Find rows that match by x and y
    res <- merge(to.merge, df, by = c("x", "y"), all.x = TRUE)
    res$dif <- abs(res$time.x - res$time.y)
    res
    ##   x y time.x val time.y dif
    ## 1 2 1   17.0   a     11 6.0
    ## 2 2 1   12.0   b     11 1.0
    ## 3 2 1   11.6   c     11 0.6
    ## 4 2 4    2.0   e      6 4.0
    ## 5 3 5   22.5   d     23 0.5
    
    #Find rows that need to be merged
    res1 <- merge(aggregate(dif ~ x + y, data = res, FUN = min), res)
    res1
    ##   x y dif time.x val time.y
    ## 1 2 1 0.6   11.6   c     11
    ## 2 2 4 4.0    2.0   e      6
    ## 3 3 5 0.5   22.5   d     23
    
    #Finally merge the result back into df
    final <- merge(df, res1[res1$dif <= 1, c("x", "y", "val")], all.x = TRUE)
    final
    ##    x y time  val
    ## 1  1 1    8 <NA>
    ## 2  1 2   27 <NA>
    ## 3  1 3   28 <NA>
    ## 4  1 4    2 <NA>
    ## 5  1 5   21 <NA>
    ## 6  2 1   11    c
    ## 7  2 2    6 <NA>
    ## 8  2 3   20 <NA>
    ## 9  2 4    6 <NA>
    ## 10 2 5   12 <NA>
    ## 11 3 1   17 <NA>
    ## 12 3 2   27 <NA>
    ## 13 3 3   19 <NA>
    ## 14 3 4    5 <NA>
    ## 15 3 5   23    d
    
    0 讨论(0)
  • 2020-12-05 11:58

    mnel's answer uses roll = "nearest" in a data.table join but does not limit to +/- 1 as requested by the OP. In addition, MichaelChirico has suggested to use the on parameter.

    This approach uses

    • roll = "nearest",
    • an update by reference, i.e., without copying,
    • setDT() to coerce a data.frame to data.table without copying (introduced 2014-02-27 with v.1.9.2 of data.table),
    • the on parameter which spares to set a key explicitely (introduced 2015-09-19 with v.1.9.6).

    So, the code below

    library(data.table)   # version 1.11.4 used
    setDT(df)[setDT(to.merge), on  = .(x, y, time), roll = "nearest",
              val := replace(val, abs(x.time - i.time) > 1, NA)]
    df
    

    has updated df:

        x y time  val
     1: 1 1    8 <NA>
     2: 2 1   11    c
     3: 3 1   17 <NA>
     4: 1 2   27 <NA>
     5: 2 2    6 <NA>
     6: 3 2   27 <NA>
     7: 1 3   28 <NA>
     8: 2 3   20 <NA>
     9: 3 3   19 <NA>
    10: 1 4    2 <NA>
    11: 2 4    6 <NA>
    12: 3 4    5 <NA>
    13: 1 5   21 <NA>
    14: 2 5   12 <NA>
    15: 3 5   23    d
    

    Note that the order of rows has not been changed (in contrast to Chinmay Patil's answer)

    In case df must not be changed, a new data.table can be created by

    result <- setDT(to.merge)[setDT(df), on  = .(x, y, time), roll = "nearest",
                    .(x, y, time, val = replace(val, abs(x.time - i.time) > 1, NA))]
    result
    

    which returns the same result as above.

    0 讨论(0)
  • 2020-12-05 12:05

    Use data.table and roll='nearest' or to limit to 1, roll = 1, rollends = c(TRUE,TRUE)

    eg

    library(data.table)
    # create data.tables with the same key columns (x, y, time)
    DT <- data.table(df, key = names(df))
    tm <- data.table(to.merge, key = key(DT))
    
    # use join syntax with roll = 'nearest'
    
    
    tm[DT, roll='nearest']
    
    #     x y time val
    #  1: 1 1    8  NA
    #  2: 1 2   27  NA
    #  3: 1 3   28  NA
    #  4: 1 4    2  NA
    #  5: 1 5   21  NA
    #  6: 2 1   11   c
    #  7: 2 2    6  NA
    #  8: 2 3   20  NA
    #  9: 2 4    6   e
    # 10: 2 5   12  NA
    # 11: 3 1   17  NA
    # 12: 3 2   27  NA
    # 13: 3 3   19  NA
    # 14: 3 4    5  NA
    # 15: 3 5   23   d
    

    You can limit your self to looking forward and back (1) by setting roll=-1 and rollends = c(TRUE,TRUE)

    new <- tm[DT, roll=-1, rollends  =c(TRUE,TRUE)]
    new
        x y time val
     1: 1 1    8  NA
     2: 1 2   27  NA
     3: 1 3   28  NA
     4: 1 4    2  NA
     5: 1 5   21  NA
     6: 2 1   11   c
     7: 2 2    6  NA
     8: 2 3   20  NA
     9: 2 4    6  NA
    10: 2 5   12  NA
    11: 3 1   17  NA
    12: 3 2   27  NA
    13: 3 3   19  NA
    14: 3 4    5  NA
    15: 3 5   23   d
    

    Or you can roll=1 first, then roll=-1, then combine the results (tidying up the val.1 column from the second rolling join)

    new <- tm[DT, roll = 1][tm[DT,roll=-1]][is.na(val), val := ifelse(is.na(val.1),val,val.1)][,val.1 := NULL]
    new
        x y time val
     1: 1 1    8  NA
     2: 1 2   27  NA
     3: 1 3   28  NA
     4: 1 4    2  NA
     5: 1 5   21  NA
     6: 2 1   11   c
     7: 2 2    6  NA
     8: 2 3   20  NA
     9: 2 4    6  NA
    10: 2 5   12  NA
    11: 3 1   17  NA
    12: 3 2   27  NA
    13: 3 3   19  NA
    14: 3 4    5  NA
    15: 3 5   23   d
    
    0 讨论(0)
提交回复
热议问题