Speeding up the processing of large data frames in R

后端 未结 1 1642
情深已故
情深已故 2021-02-08 10:59

Context

I have been trying to implement the algorithm recently proposed in this paper. Given a large amount of text (corpus), the algorithm is supposed to return chara

相关标签:
1条回答
  • 2021-02-08 11:30

    The following runs in under 7 seconds on my machine, for all the bigrams:

    library(dplyr)
    res <- inner_join(token.df[[2]],token.df[[3]],by = c('w1','w2'))
    res <- group_by(res,w1,w2)
    bigrams <- filter(summarise(res,keep = all(mi.y < mi.x)),keep)
    

    There's nothing special about dplyr here. An equally fast (or faster) solution could surely be done using data.table or directly in SQL. You just need to switch to using joins (as in SQL) rather than iterating through everything yourself. In fact, I wouldn't be surprised if simply using merge in base R and then aggregate wouldn't be orders of magnitude faster than what you're doing now. (But you really should be doing this with data.table, dplyr or directly in a SQL data base).

    Indeed, this:

    library(data.table)
    dt2 <- setkey(data.table(token.df[[2]]),w1,w2)
    dt3 <- setkey(data.table(token.df[[3]]),w1,w2)
    dt_tmp <- dt3[dt2,allow.cartesian = TRUE][,list(k = all(mi < mi.1)),by = c('w1','w2')][(k)]
    

    is even faster still (~2x). I'm not even really sure that I've squeezed all the speed I could have out of either package, to be honest.


    (edit from Rick. Attempted as comment, but syntax was getting messed up)
    If using data.table, this should be even faster, as data.table has a by-without-by feature (See ?data.table for more info):

     dt_tmp <- dt3[dt2,list(k = all(mi < i.mi)), allow.cartesian = TRUE][(k)]
    

    Note that when joining data.tables you can preface the column names with i. to indicate to use the column from specifically the data.table in the i= argument.

    0 讨论(0)
提交回复
热议问题