How to parallelelize do() calls with dplyr

后端未结

关注

 3  977

I\'m trying to figure out how to deploy the dplyr::do function in parallel. After reading some the docs it seems that the dplyr::init_cluster() should be suffic

相关标签:

3条回答

遥遥无期

2021-01-04 00:15

As per mentionned by @Maciej, you could try multidplyr:

## Install from github
devtools::install_github("hadley/multidplyr")

Use partition() to split your dataset across multiples cores:

library(dplyr)
library(multidplyr)
test <- data_frame(a=1:3, b=letters[c(1:2, 1)])
test1 <- partition(test, a)

You'll initialize a 3 cores cluster (one for each a)

# Initialising 3 core cluster.

Then simply perform your do() call:

test1 %>%
  do({
    dplyr::data_frame(c = rep(max(.$a)), times = max(.$a))
  })

Which gives:

#Source: party_df [3 x 3]
#Groups: a
#Shards: 3 [1--1 rows]
#
#      a     c times
#  (int) (int) (int)
#1     1     1     1
#2     2     2     2
#3     3     3     3

0 讨论(0)

梦如初夏

2021-01-04 00:16

According to https://twitter.com/cboettig/status/588068454239830017 this feature does not seem to be currently supported.

0 讨论(0)
发布评论:

提交评论
- 加载中...
栀梦

2021-01-04 00:31

You could check Hadley's new package multidplyr.

0 讨论(0)
发布评论:

提交评论
- 加载中...