Speeding up the processing of large data frames in R

浪子不回头ぞ 提交于 2019-12-03 07:03:58
joran

The following runs in under 7 seconds on my machine, for all the bigrams:

library(dplyr)
res <- inner_join(token.df[[2]],token.df[[3]],by = c('w1','w2'))
res <- group_by(res,w1,w2)
bigrams <- filter(summarise(res,keep = all(mi.y < mi.x)),keep)

There's nothing special about dplyr here. An equally fast (or faster) solution could surely be done using data.table or directly in SQL. You just need to switch to using joins (as in SQL) rather than iterating through everything yourself. In fact, I wouldn't be surprised if simply using merge in base R and then aggregate wouldn't be orders of magnitude faster than what you're doing now. (But you really should be doing this with data.table, dplyr or directly in a SQL data base).

Indeed, this:

library(data.table)
dt2 <- setkey(data.table(token.df[[2]]),w1,w2)
dt3 <- setkey(data.table(token.df[[3]]),w1,w2)
dt_tmp <- dt3[dt2,allow.cartesian = TRUE][,list(k = all(mi < mi.1)),by = c('w1','w2')][(k)]

is even faster still (~2x). I'm not even really sure that I've squeezed all the speed I could have out of either package, to be honest.


(edit from Rick. Attempted as comment, but syntax was getting messed up)
If using data.table, this should be even faster, as data.table has a by-without-by feature (See ?data.table for more info):

 dt_tmp <- dt3[dt2,list(k = all(mi < i.mi)), allow.cartesian = TRUE][(k)]

Note that when joining data.tables you can preface the column names with i. to indicate to use the column from specifically the data.table in the i= argument.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!