How to select rows from one data.table to apply in another data.table?

后端 未结 2 1285
旧时难觅i
旧时难觅i 2021-01-24 00:13

I have two data.tables df (21 MIO rows) and tmp (500k rows)

df has three columns linking an original patent (origpat)

相关标签:
2条回答
  • 2021-01-24 00:40

    If my understanding of the problem is correct all you need is join these two tables on ref.pat. Make sure classes of ref.pat in df and pnum in tmp are the same. Then the desired join would be obtained by:

    library(data.table)
    
    df <- data.table(df)
    tmp <- data.table(tmp)
    
    setkey(df, 'ref.pat')
    out <- df[tmp, nomatch = 0]
    
    0 讨论(0)
  • 2021-01-24 00:52

    Best idea I came with is:

    df[,idx := .I] # Add an index to the data.table to group by row of df
    df[,compare := sum(tmp[pnum == ref.pat, prim] == mainprim) /
         length(tmp[pnum == ref.pat,prim]),by = idx]
    

    Or reusing your overlap function (still using the idx column):

    df[,compare := overlap(
                    mainprim,
                    tmp[pnum == ref.pat, prim]),
        by=idx]
    

    What it does here is grouping by row and then use columns from Subset Data to get the mainprim for this row and the subsets of tmp needed.

    If you want to avoid creating the idx column you can use by=1:nrow(df) instead but this could slow down the process (using an actual column is quicker in data.table).


    Great improvements by @Docendo:

    You can further speed up the process by creating an intermediate variable to store the subset instead of doing the subset twice per row:

    df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)}, by = idx]
    

    And in case there are duplicated combinations of ref.pat and mainprim in df you could further optimize the performance by using by = list(ref.pat, mainprim) instead of by = idx:

    df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)},
       by = list(ref.pat, mainprim)]
    

    And another, probably just minimal, improvement could be done by using mean() instead of sum()/length():

    df[,compare := mean(tmp[pnum == ref.pat, prim] == mainprim), by = list(ref.pat, mainprim)]
    
    0 讨论(0)
提交回复
热议问题