Data Table merge based on date ranges

前端 未结 2 929
梦谈多话
梦谈多话 2020-11-29 03:57

I have two tables, policies and claims

policies<-data.table(policyNumber=c(123,123,124,125), 
                EFDT=as.Date(c(\"2         


        
相关标签:
2条回答
  • 2020-11-29 04:01

    Version 1 (updated for data.table v1.9.4+)

    Try this:

    # Policies table; I've added policyNumber 126:
    policies<-data.table(policyNumber=c(123,123,124,125,126), 
                         EFDT=as.Date(c("2012-01-01","2013-01-01","2013-01-01","2013-02-01","2013-02-01")), 
                         EXDT=as.Date(c("2013-01-01","2014-01-01","2014-01-01","2014-02-01","2014-02-01")))
    
    # Claims table; I've added two claims for 126 that are before and after the policy dates:
    claims<-data.table(claimNumber=c(1,2,3,4,5,6), 
                       policyNumber=c(123,123,123,124,126,126),
                       lossDate=as.Date(c("2012-2-1","2012-8-15","2013-1-1","2013-10-31","2012-06-01","2014-03-01")),
                       claimAmount=c(10,20,20,15,5,25))
    
    # Set the keys for policies and claims so we can join them:
    setkey(policies,policyNumber,EFDT)
    setkey(claims,policyNumber,lossDate)
    
    # Join the tables using roll
    # ans<-policies[claims,list(EFDT,EXDT,claimNumber,lossDate,claimAmount,inPolicy=F),roll=T][,EFDT:=NULL] ## This worked with earlier versions of data.table, but broke when they updated the by-without-by behavior...
    ans<-policies[claims,list(.EFDT=EFDT,EXDT,claimNumber,lossDate,claimAmount,inPolicy=F),by=.EACHI,roll=T][,`:=`(EFDT=.EFDT, .EFDT=NULL)]
    
    # The claim should have inPolicy==T where lossDate is between EFDT and EXDT:
    ans[lossDate>=EFDT & lossDate<=EXDT, inPolicy:=T]
    
    # Set the keys again, but this time we'll join on both dates:
    setkey(ans,policyNumber,EFDT,EXDT)
    setkey(policies,policyNumber,EFDT,EXDT)
    
    # Union the ans table with policies that don't have any claims:
    ans<-rbindlist(list(ans, ans[policies][is.na(claimNumber)]))
    
    ans
    #   policyNumber       EFDT       EXDT claimNumber   lossDate claimAmount inPolicy
    #1:          123 2012-01-01 2013-01-01           1 2012-02-01          10     TRUE
    #2:          123 2012-01-01 2013-01-01           2 2012-08-15          20     TRUE
    #3:          123 2013-01-01 2014-01-01           3 2013-01-01          20     TRUE
    #4:          124 2013-01-01 2014-01-01           4 2013-10-31          15     TRUE
    #5:          126       <NA>       <NA>           5 2012-06-01           5    FALSE
    #6:          126 2013-02-01 2014-02-01           6 2014-03-01          25    FALSE
    #7:          125 2013-02-01 2014-02-01          NA       <NA>          NA       NA
    

    Version 2

    @Arun suggested using the new foverlaps function from data.table. My attempt below seems harder, not easier, so please let me know how to improve it.

    ## The foverlaps function requires both tables to have a start and end range, and the "y" table to be keyed
    claims[, lossDate2:=lossDate]  ## Add a redundant lossDate column to use as the end range for claims
    setkey(policies, policyNumber, EFDT, EXDT) ## Set the key for policies ("y" table)
    
    ## Find the overlaps, remove the redundant lossDate2 column, and add the inPolicy column:
    ans2 <- foverlaps(claims, policies, by.x=c("policyNumber", "lossDate", "lossDate2"))[, `:=`(inPolicy=T, lossDate2=NULL)]
    
    ## Update rows where the claim was out of policy:
    ans2[is.na(EFDT), inPolicy:=F]
    
    ## Remove duplicates (such as policyNumber==123 & claimNumber==3),
    ##   and add policies with no claims (policyNumber==125):
    setkey(ans2, policyNumber, claimNumber, lossDate, EFDT) ## order the results
    setkey(ans2, policyNumber, claimNumber) ## set the key to identify unique values
    ans2 <- rbindlist(list(
      unique(ans2), ## select only the unique values
      policies[!.(ans2[, unique(policyNumber)])] ## policies with no claims
    ), fill=T)
    
    ans2
    ##    policyNumber       EFDT       EXDT claimNumber   lossDate claimAmount inPolicy
    ## 1:          123 2012-01-01 2013-01-01           1 2012-02-01          10     TRUE
    ## 2:          123 2012-01-01 2013-01-01           2 2012-08-15          20     TRUE
    ## 3:          123 2012-01-01 2013-01-01           3 2013-01-01          20     TRUE
    ## 4:          124 2013-01-01 2014-01-01           4 2013-10-31          15     TRUE
    ## 5:          126       <NA>       <NA>           5 2012-06-01           5    FALSE
    ## 6:          126       <NA>       <NA>           6 2014-03-01          25    FALSE
    ## 7:          125 2013-02-01 2014-02-01          NA       <NA>          NA       NA
    

    Version 3

    Using foverlaps(), another version:

    require(data.table) ## 1.9.4+
    setDT(claims)[, lossDate2 := lossDate]
    setDT(policies)[, EXDTclosed := EXDT-1L]
    setkey(claims, policyNumber, lossDate, lossDate2)
    foverlaps(policies, claims, by.x=c("policyNumber", "EFDT", "EXDTclosed"))
    

    foverlaps() requires both start and end ranges/intervals. Therefore, we duplicate lossDate column on to lossDate2.

    Since EXDT needs to be open interval, we subtract one from it, and place it in a new column EXDTclosed.

    Now, we set the key. foverlaps() requires the last two key columns to be intervals. So they're specified last. And we also want overlapping join to first match by policyNumber. Hence, it's also specified in the key.

    We need to set key on claims (check ?foverlaps). We don't have to set key on policies. But you can if you wish (then you can skip by.x argument as it by default takes the key value). Since we don't set the key for policies here, we'll specify explicitly the corresponding columns in by.x argument. The overlap type by default is any, which we don't have to change (and therefore not specified). This results in:

    #    policyNumber claimNumber   lossDate claimAmount  lossDate2       EFDT       EXDT EXDTclosed
    # 1:          123           1 2012-02-01          10 2012-02-01 2012-01-01 2013-01-01 2012-12-31
    # 2:          123           2 2012-08-15          20 2012-08-15 2012-01-01 2013-01-01 2012-12-31
    # 3:          123           3 2013-01-01          20 2013-01-01 2013-01-01 2014-01-01 2013-12-31
    # 4:          124           4 2013-10-31          15 2013-10-31 2013-01-01 2014-01-01 2013-12-31
    # 5:          125          NA       <NA>          NA       <NA> 2013-02-01 2014-02-01 2014-01-31
    
    0 讨论(0)
  • 2020-11-29 04:03

    I think this does mostly what you want. I need to run so don't have time to add the policy with no claims and clean the columns up, but I think the difficult issues are addressed:

    setkey(policies, policyNumber, EXDT)
    policies[, EXDT2:=EXDT]
    policies[claims[, list( policyNumber, lossDate, lossDate, claimNumber, claimAmount)], roll=-Inf]
    #    policyNumber       EXDT       EFDT      EXDT2   lossDate claimNumber claimAmount
    # 1:          123 2012-02-01 2012-01-01 2013-01-01 2012-02-01           1          10
    # 2:          123 2012-08-15 2012-01-01 2013-01-01 2012-08-15           2          20
    # 3:          123 2013-01-01 2012-01-01 2013-01-01 2013-01-01           3          20
    # 4:          124 2013-10-31 2013-01-01 2014-01-01 2013-10-31           4          15
    

    Also, note it is trivial to remove/highlight claims outside of policy dates from this result.

    0 讨论(0)
提交回复
热议问题