I have two data.tables df
(21 MIO rows) and tmp
(500k rows)
df
has three columns linking an original patent (origpat
)
If my understanding of the problem is correct all you need is join these two tables on ref.pat
. Make sure classes of ref.pat
in df
and pnum
in tmp
are the same. Then the desired join would be obtained by:
library(data.table)
df <- data.table(df)
tmp <- data.table(tmp)
setkey(df, 'ref.pat')
out <- df[tmp, nomatch = 0]
Best idea I came with is:
df[,idx := .I] # Add an index to the data.table to group by row of df
df[,compare := sum(tmp[pnum == ref.pat, prim] == mainprim) /
length(tmp[pnum == ref.pat,prim]),by = idx]
Or reusing your overlap
function (still using the idx column):
df[,compare := overlap(
mainprim,
tmp[pnum == ref.pat, prim]),
by=idx]
What it does here is grouping by row and then use columns from Subset Data to get the mainprim
for this row and the subsets of tmp
needed.
If you want to avoid creating the idx
column you can use by=1:nrow(df)
instead but this could slow down the process (using an actual column is quicker in data.table
).
Great improvements by @Docendo:
You can further speed up the process by creating an intermediate variable to store the subset instead of doing the subset twice per row:
df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)}, by = idx]
And in case there are duplicated combinations of ref.pat
and mainprim
in df
you could further optimize the performance by using by = list(ref.pat, mainprim)
instead of by = idx
:
df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)},
by = list(ref.pat, mainprim)]
And another, probably just minimal, improvement could be done by using mean()
instead of sum()/length()
:
df[,compare := mean(tmp[pnum == ref.pat, prim] == mainprim), by = list(ref.pat, mainprim)]