R data.table set new column with logical value if a weekday is between a date range

前端 未结 3 1080
萌比男神i
萌比男神i 2021-01-15 09:54

I have a data.table object with two date columns, from and to. I want to create a new column to determine if a specific w

3条回答
  •  轻奢々
    轻奢々 (楼主)
    2021-01-15 10:17

    Here's one approach:

    next_wday <- function(d,wd=4L){
        wddiff = wd - wday(d)
        d + wddiff + (wddiff < 0L)*7L
    } 
    
    
    DT[, flag2 := +(next_wday(from) <= to)]
    
    # test:
    DT[,table(flag,flag2)]
    #     flag2
    # flag  0  1
    #    0 44  0
    #    1  0 57
    

    The idea is that you compare to against the next Thursday**. The replacement line could be written a number of different ways.

    Benchmark

    The OP mentioned that from and to could be up to 200 days apart so...

    set.seed(1)
    from <- seq(as.IDate("1950-01-01"), by = "day", length = 1e6)
    to   <- from + pmin(200,rpois(length(from),1))
    DT   <- data.table(from,to)
    
    system.time(DT[, flag2 := +(next_wday(from) <= to)])
    #    user  system elapsed 
    #    2.11    0.03    2.14
    
    # David Arenburg's solution
    system.time({
        DateDT <- DT[, {
                    temp <- seq(min(from), max(to), by = "day")
                    temp2 <- temp[wday(temp) == 4L]
                    list(from = temp2, to = temp2)
                   }
                 ]
        indx <- foverlaps(DT, setkey(DateDT), nomatch = 0L, which = TRUE)$xid
        DT[, flag := 0L][indx, flag := 1L]
    })  
    #    user  system elapsed 
    #    6.75    0.14    6.89
    
    # check agreement
    DT[,table(flag,flag2)]
    #     flag2
    # flag      0      1
    #    0 714666      0
    #    1      0 285334
    

    I'm using IDate because it is the date format that comes with the data.table package and is (?) faster to work with. There are a couple of ways one could make the code even faster:

    • First, it might be faster to restrict attention to rows where to-from is less than 6 (since any gap 6 or greater will have every weekday), like

      DT[,flag2:=0L][to-from < 6, flag2 := +(next_wday(from) <= to)]
      
    • Second, because the computation only depends on one row at a time, parallelization may lead to some improvement, as illustrated in @grubjesic's answer.

    • Depending on the data on one's real data, additional improvements might be found.

    The OP's code isn't benchmarked here because it entails splitting the data by rows and enumerating up to 200 dates per row, which will certainly be slow.


    ** or whatever wday being 4 means.

提交回复
热议问题