I am new to R and this is my first question on stackoverflow.
I am trying
Below I show 5 working data.table
solutions that were candidates to
performance test against the OP's actual data set (1.4M records).
All 5 solutions use "non-equi" joins (using inequality to compare
columns for the join) in the on
clause.
Each solution is just a small progressive code change so it should be
easy to follow along to compare different data.table
options and syntax choices.
To work through data.table
syntax for this I broke it into to the following steps for the OP's problem:
# Add row numbers to all records in dt (only because you
# have criteria based on comparing sequential rows):
dt[, row := .I]
# Compute result columns ( then standard assignment into dt using <- )
dt$found_date <-
dt[code=='p'][dt, # join dt to the data.table matching your criteria, in this case dt[code=='p']
.( x.date_up ), # columns to select, x. prefix means columns from dt[code=='p']
on = .(id==id, row > row, date_up > date_down), # join criteria: dt[code=='p'] fields on LHS, main dt fields on RHS
mult = "first"] # get only the first match if multiple matches
Note in the join expressions above:
i
in this case is your main dt. This way you get all records from your main data.table. x
is the subset (or any other data.table) from which you want to find matching values.Result matches requested output:
dt
id code date_down date_up row found_date
1: 1 p 2019-01-01 2019-01-02 1 <NA>
2: 1 f 2019-01-02 2019-01-03 2 <NA>
3: 2 f 2019-01-02 2019-01-02 3 <NA>
4: 2 p 2019-01-03 <NA> 4 <NA>
5: 3 p 2019-01-04 <NA> 5 <NA>
6: 4 <NA> 2019-01-05 2019-01-05 6 <NA>
7: 5 f 2019-01-07 2019-01-08 7 2019-01-08
8: 5 p 2019-01-07 2019-01-08 8 2019-01-09
9: 5 p 2019-01-09 2019-01-09 9 <NA>
10: 6 f 2019-01-10 2019-01-10 10 2019-01-11
11: 6 p 2019-01-10 2019-01-10 11 2019-01-11
12: 6 p 2019-01-10 2019-01-11 12 <NA>
Note: You may remove the row
column by doing dt[, row := NULL]
if you like.
Identical logic as above to join and find the result columns, but now using "assign by reference" :=
to create found_date
in dt
:
dt[, row := .I] # add row numbers (as in all the solutions)
# Compute result columns ( then assign by reference into dt using :=
# dt$found_date <-
dt[, found_date := # assign by reference to dt$found_date
dt[code=='p'][dt,
.( x.date_up ),
on = .(id==id, row > row, date_up > date_down),
mult = "first"]]
In Solution 2, the slight variation to assign our results "by reference" into dt should be more efficient than Solution 1. Solution 1 calculated results the exact same way - the only difference is Solution 1 used standard assignment <-
to create dt$found_date
(less efficient).
Like Solution 2 but now using .(.SD)
in place of dt
to refer to the original dt without naming it directly.
dt[, row := .I] # add row numbers (as in all the solutions)
setkey(dt, id, row, date_down) #set key for dt
# For all rows of dt, create found_date by reference :=
dt[, found_date :=
# dt[code=='p'][dt,
dt[code=='p'][.(.SD), # our subset (or another data.table), joined to .SD (referring to original dt)
.( x.date_up ),
on = .(id==id, row > row, date_up > date_down),
mult = "first"] ]
.SD above references back to the original dt that we are assigning back into. It corresponds to the subset of data.table that contains the rows selected in the first dt[,
which is all the rows because we didn't filter it.
Note: In Solution 3 I used setkey()
to set the key. I should have done that in Solution 1 & Solution 2 - however I didn't want to change those solutions after @OllieB tested them successfully.
Like Solution 3 but using .SD once more than previously. Our main data.table name dt
now appears only once across our entire expression!
# add row column and setkey() as previous solutions
dt[, found_date :=
# dt[code=='p'][.(.SD),
.SD[code=='p'][.SD, # .SD in place of dt at left! Also, removed .() at right (not sure on this second change)
.(found_date = x.date_up),
on = .(id==id, row > row, date_up > date_down),
mult = "first"]]
With the change above our data.table name dt
appears only once. I like that a lot because it makes it easy to copy, adapt and reuse elsewhere.
Also note: Where I'd previously used .(SD)
I've now removed the .() around .SD
because it doesn't appear to require it. However for that change I'm not sure if it has any performance benefit or whether it's data.table preferred syntax. I would be grateful if anyone can add a comment to advise on that point.
Like previous solutions but making use of by
to explicitly group subsets over operations when joining
# add row column and setkey() as previous solutions
dt[, found_date :=
.SD[code=='p'][.SD,
.(found_date = x.date_up),
# on = .(id==id, row > row, date_up > date_down),
on = .(row > row, date_up > date_down), # removed the id column from here
mult = "first"]
, by = id] # added by = id to group the .SD subsets
On this last solution I changed it to use the by
clause to explicitly group the .SD subsets on id
.
Note: Solution 5 did not perform well against OllieB's actual data compared to Solutions 1 - 4. However, testing my own mock data I found that Solution 5 could perform well when the number of unique groups from the id
column were low:
- With only 6 groups in 1.5M records this solution worked just as fast as the others.
- With 40k groups in 1.5M records I saw similar poor performance as OllieB reported.
For 1.45M records in OllieB's actual data each of Solutions 1 to 4 were all 2.42 seconds or less "elapsed" time according to OllieB's feedback. Solution 3 appears worked fastest for OllieB having "elapsed=1.22" seconds.
I personally prefer Solution 4 because of the simpler syntax.
by
clause) performed poorly taking 577 seconds for OllieB's testing on his real data. data.table version: 1.12.0
R version 3.5.3 (2019-03-11)
on
envoking [often] more efficient secondary indicies and auto indexing. As part of your question you've asked for "any good references to data.table". I've found the following helpful:
data.table Getting started Wiki on GitHub is the place to start.
In particular for this problem it's worth reading:
Importantly note this answer by @Arun which explains "the reason for implementing on= argument" suggests it may no longer be necessary to set keys any more:
It is therefore essential to figure out if the time spent on reordering the entire data.table is worth the time to do a cache-efficient join/aggregation. Usually, unless there are repetitive grouping / join operations being performed on the same keyed data.table, there should not be a noticeable difference.
In most cases therefore, there shouldn't be a need to set keys any more. We recommend using on= wherever possible, unless setting key has a dramatic improvement in performance that you'd like to exploit.
This SO question seems to be the hub of information about the different data.table
joins: How to join (merge) data frames (inner, outer, left, right)?
Finally, data.table cheat sheet is an excellent reference (from link found on the data.table Getting started Wiki on GitHub).
As always I'm grateful if anyone has suggestions as perhaps this can be improved further.
Please feel free to comment, correct or post other solutions if you can add anything.
Here is a quick and dirty way which doesn't require much thinking on your part, and captures the first viable option in the subset and leaves an NA
if non exists.
the do(f(.))
call evaluates the predefined function f
on each subset of dt
defined by the group_by
statement. I would go translate that simple script into Rcpp
for serious use.
library(dplyr)
f <- function(x){
x <- x %>% mutate(founddate = as.Date(NA))
for(i in 1:nrow(x)){
y <- x[i, "date_down"]
x[i, "founddate"] <-(x[-c(1:i),] %>% filter(code == "p", date_up > y) %>% select(date_up))[1, ]
}
return(x)
}
dt %>% group_by(id) %>% do(f(.))
# A tibble: 12 x 5
# Groups: id [6]
id code date_down date_up founddate
<int> <chr> <date> <date> <date>
1 1 p 2019-01-01 2019-01-02 NA
2 1 f 2019-01-02 2019-01-03 NA
3 2 f 2019-01-02 2019-01-02 NA
4 2 p 2019-01-03 NA NA
5 3 p 2019-01-04 NA NA
6 4 <NA> 2019-01-05 2019-01-05 NA
7 5 f 2019-01-07 2019-01-08 2019-01-08
8 5 p 2019-01-07 2019-01-08 2019-01-09
9 5 p 2019-01-09 2019-01-09 NA
10 6 f 2019-01-10 2019-01-10 2019-01-11
11 6 p 2019-01-10 2019-01-10 2019-01-11
12 6 p 2019-01-10 2019-01-11 NA
Your Comment about terrible performance is unsurprising. I would personal message this if I knew how, but below is a Rcpp::cppFunction
to do the same thing.
Rcpp::cppFunction('DataFrame fC(DataFrame x) {
int i, j;
int n = x.nrows();
CharacterVector code = x["code"];
DateVector date_up = x["date_up"];
DateVector date_down = x["date_down"];
DateVector founddate = rep(NA_REAL, n);
for(i = 0; i < n; i++){
for(j = i + 1; j < n; j++){
if(code(j) == "p"){
if(date_up(j) > date_down(i)){
founddate(i) = date_up(j);
break;
} else{
continue;
}
} else{
continue;
}
}
}
x.push_back(founddate, "founddate");
return x;
}')
dt %>% group_by(id) %>% do(fC(.))
A Not-the-Data-Table-Way approach:
> df <- structure(list(
+ id = c(1L, 1L, 2L, 2L, 3L, 4L, 5L, 5L, 5L, 6L, 6L, 6L),
+ code = c("p", "f", "f", "p", "p", "<NA>", "f", "p", "p", "f", "p", "p"),
+ date_down = structure(c(17897, 17898, 17898, 17899, 17900, 17901, 17903, 17903, 17905, 17906, 17906, 17906), class = "Date"),
+ date_up = structure(c(17898, 17899, 17898, NA, NA, 17901, 17904, 17904, 17905, 17906, 17906, 17907), class = "Date")),
+ class = c("data.frame"),
+ row.names = c(NA, -12L))
>
>
> Lista <- lapply(split(df, df$id), function(x){
+ x$founddate <-
+ sapply(c(1:nrow(x)), function(y){
+ na.omit(sapply(y:nrow(x), function(i){
+ ifelse(x[i + 1, "code"] == "p" & x[i + 1, "date_up"] > x[y, "date_down"],
+ x[i + 1, "date_up"], NA)
+ }))[1]
+ })
+ x$founddate <- as.Date(x$founddate, origin = "1970-01-01")
+ return(x)
+ })
>
>
> df <- do.call(rbind.data.frame, Lista)
>
> df
id code date_down date_up founddate
1.1 1 p 2019-01-01 2019-01-02 <NA>
1.2 1 f 2019-01-02 2019-01-03 <NA>
2.3 2 f 2019-01-02 2019-01-02 <NA>
2.4 2 p 2019-01-03 <NA> <NA>
3 3 p 2019-01-04 <NA> <NA>
4 4 <NA> 2019-01-05 2019-01-05 <NA>
5.7 5 f 2019-01-07 2019-01-08 2019-01-08
5.8 5 p 2019-01-07 2019-01-08 2019-01-09
5.9 5 p 2019-01-09 2019-01-09 <NA>
6.10 6 f 2019-01-10 2019-01-10 2019-01-11
6.11 6 p 2019-01-10 2019-01-10 2019-01-11
6.12 6 p 2019-01-10 2019-01-11 <NA>
>
Under the given conditions, there are more than one match per row. The proposed answer gets the first match, but this can be modified.
Hope it helps.