How to select rows from one data.table to apply in another data.table?

后端未结

关注

 2  1285

旧时难觅i

I have two data.tables df (21 MIO rows) and tmp (500k rows)

df has three columns linking an original patent (origpat)

相关标签:

2条回答

我寻月下人不归

2021-01-24 00:40
If my understanding of the problem is correct all you need is join these two tables on ref.pat. Make sure classes of ref.pat in df and pnum in tmp are the same. Then the desired join would be obtained by:
```
library(data.table)

df <- data.table(df)
tmp <- data.table(tmp)

setkey(df, 'ref.pat')
out <- df[tmp, nomatch = 0]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
天命终不由人

2021-01-24 00:52
Best idea I came with is:
```
df[,idx := .I] # Add an index to the data.table to group by row of df
df[,compare := sum(tmp[pnum == ref.pat, prim] == mainprim) /
     length(tmp[pnum == ref.pat,prim]),by = idx]
```
Or reusing your overlap function (still using the idx column):
```
df[,compare := overlap(
                mainprim,
                tmp[pnum == ref.pat, prim]),
    by=idx]
```
What it does here is grouping by row and then use columns from Subset Data to get the mainprim for this row and the subsets of tmp needed.

If you want to avoid creating the idx column you can use by=1:nrow(df) instead but this could slow down the process (using an actual column is quicker in data.table).

Great improvements by @Docendo:

You can further speed up the process by creating an intermediate variable to store the subset instead of doing the subset twice per row:
```
df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)}, by = idx]
```
And in case there are duplicated combinations of ref.pat and mainprim in df you could further optimize the performance by using by = list(ref.pat, mainprim) instead of by = idx:
```
df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)},
   by = list(ref.pat, mainprim)]
```
And another, probably just minimal, improvement could be done by using mean() instead of sum()/length():
```
df[,compare := mean(tmp[pnum == ref.pat, prim] == mainprim), by = list(ref.pat, mainprim)]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...