I have been trying to implement the algorithm recently proposed in this paper. Given a large amount of text (corpus), the algorithm is supposed to return chara
The following runs in under 7 seconds on my machine, for all the bigrams:
library(dplyr)
res <- inner_join(token.df[[2]],token.df[[3]],by = c('w1','w2'))
res <- group_by(res,w1,w2)
bigrams <- filter(summarise(res,keep = all(mi.y < mi.x)),keep)
There's nothing special about dplyr here. An equally fast (or faster) solution could surely be done using data.table or directly in SQL. You just need to switch to using joins (as in SQL) rather than iterating through everything yourself. In fact, I wouldn't be surprised if simply using merge
in base R and then aggregate
wouldn't be orders of magnitude faster than what you're doing now. (But you really should be doing this with data.table, dplyr or directly in a SQL data base).
Indeed, this:
library(data.table)
dt2 <- setkey(data.table(token.df[[2]]),w1,w2)
dt3 <- setkey(data.table(token.df[[3]]),w1,w2)
dt_tmp <- dt3[dt2,allow.cartesian = TRUE][,list(k = all(mi < mi.1)),by = c('w1','w2')][(k)]
is even faster still (~2x). I'm not even really sure that I've squeezed all the speed I could have out of either package, to be honest.
(edit from Rick. Attempted as comment, but syntax was getting messed up)
If using data.table
, this should be even faster, as data.table
has a by-without-by
feature (See ?data.table
for more info):
dt_tmp <- dt3[dt2,list(k = all(mi < i.mi)), allow.cartesian = TRUE][(k)]
Note that when joining data.tables
you can preface the column names with i.
to indicate to use the column from specifically the data.table in the i=
argument.