I have a data.table
object with two date
columns, from
and to
. I want to create a new column to determine if a specific w
Here's one approach:
next_wday <- function(d,wd=4L){
wddiff = wd - wday(d)
d + wddiff + (wddiff < 0L)*7L
}
DT[, flag2 := +(next_wday(from) <= to)]
# test:
DT[,table(flag,flag2)]
# flag2
# flag 0 1
# 0 44 0
# 1 0 57
The idea is that you compare to
against the next Thursday**. The replacement line could be written a number of different ways.
Benchmark
The OP mentioned that from
and to
could be up to 200 days apart so...
set.seed(1)
from <- seq(as.IDate("1950-01-01"), by = "day", length = 1e6)
to <- from + pmin(200,rpois(length(from),1))
DT <- data.table(from,to)
system.time(DT[, flag2 := +(next_wday(from) <= to)])
# user system elapsed
# 2.11 0.03 2.14
# David Arenburg's solution
system.time({
DateDT <- DT[, {
temp <- seq(min(from), max(to), by = "day")
temp2 <- temp[wday(temp) == 4L]
list(from = temp2, to = temp2)
}
]
indx <- foverlaps(DT, setkey(DateDT), nomatch = 0L, which = TRUE)$xid
DT[, flag := 0L][indx, flag := 1L]
})
# user system elapsed
# 6.75 0.14 6.89
# check agreement
DT[,table(flag,flag2)]
# flag2
# flag 0 1
# 0 714666 0
# 1 0 285334
I'm using IDate
because it is the date format that comes with the data.table package and is (?) faster to work with. There are a couple of ways one could make the code even faster:
First, it might be faster to restrict attention to rows where to-from
is less than 6 (since any gap 6 or greater will have every weekday), like
DT[,flag2:=0L][to-from < 6, flag2 := +(next_wday(from) <= to)]
Second, because the computation only depends on one row at a time, parallelization may lead to some improvement, as illustrated in @grubjesic's answer.
Depending on the data on one's real data, additional improvements might be found.
The OP's code isn't benchmarked here because it entails splitting the data by rows and enumerating up to 200 dates per row, which will certainly be slow.
** or whatever wday
being 4 means.
You could also try the foverlaps
approach
First will create data set of all the Wednesday starting from min(from)
and ending at max(to)
DateDT <- DT[, {
temp <- seq(min(from), max(to), by = "day")
temp2 <- temp[wday(temp) == 4L]
.(from = temp2, to = temp2)
}
]
Then run foverlaps
and extract desired rows
indx <- foverlaps(DT, setkey(DateDT), nomatch = 0L, which = TRUE)$xid
Then a simple update by reference will do
DT[, flag := 0L][indx, flag := 1L]
DT[, table(flag)]
# 0 1
# 44 57
Here's my example:
library(parallel)
process <- function(){
from <- seq(as.Date("1950-01-01"), by = "day", length = 100000)
to <- seq(as.Date("1950-01-04"), by = "day", length = 100000)
DT <- data.frame(from,to)
Ncores <- detectCores()
flagList <- mclapply(1:nrow(DT),function(id){
4 %in% strftime(seq(as.Date(DT[id,1]), as.Date(DT[id,2]), by="day"), format="%w")
},mc.cores=Ncores)
flag <- unlist(flagList)
return(cbind(DT,flag))
}
It takes just 15 sec for 100k rows on my i7 processor. Hope this helps.