how to find dates that overlap from two different dataframes and subset

后端 未结 1 1643
谎友^
谎友^ 2021-01-15 20:25

I would like to use a date from dataframe A to find any dates within 180 days of this date to select rows in dataframe B, with matching ID\'s.

eg.

         


        
1条回答
  •  鱼传尺愫
    2021-01-15 20:54

    If you have a big data, I would suggest using data.tables rolling join instead

    Assuming these are your data sets

    dfa <- read.table(text = "ID  Date
                      42  '2012-07-21'
                      42  '2013-04-12'", header = TRUE)
    
    dfb <- read.table(text = "ID Date
                      12 '2016-09-08'
                      35 '2008-02-02'
                      42 '2012-01-09'
                      42 '2013-03-13'", header = TRUE)
    

    We will convert them to data.tables and convert the Date column to IDate class

    library(data.table) #1.9.8+
    setDT(dfa)[, Date := as.IDate(Date)]
    setDT(dfb)[, Date := as.IDate(Date)]
    

    Then, simply join away (you can do the rolling join both ways)

    # You can perform another rolling join for `roll = -180` too
    indx <- dfb[
                dfa, # Per each row in dfa find a match in dfb
                on = .(ID, Date), # The columns to join by
                roll = 180, # Rolling window, can join again on -180 afterwards
                which = TRUE, # Return the row index within `dfb` that been matched
                mult = "first", # Multiple match handling- take only the first match
                nomatch = 0L # Don't return unmatched indexes (NAs)
               ]
    
    dfb[indx]
    #    ID       Date
    # 1: 42 2013-03-13
    

    An alternative way achieving this, is to use data.tables non-equi join feature on Date +-180 (manually created) columns

    # Create range columns
    dfa[, c("Date_m_180", "Date_p_180") := .(Date - 180L, Date + 180L)]
    
    # Join away
    indx <- dfb[dfa, 
                on = .(ID, Date >= Date_m_180, Date <= Date_p_180), 
                which = TRUE, 
                mult = "first",
                nomatch = 0L]
    dfb[indx]
    #    ID       Date
    # 1: 42 2013-03-13
    

    Both methods should handle large data sets almost instantly

    0 讨论(0)
提交回复
热议问题