How to parallelelize do() calls with dplyr

后端 未结 3 977
清酒与你
清酒与你 2021-01-04 00:12

I\'m trying to figure out how to deploy the dplyr::do function in parallel. After reading some the docs it seems that the dplyr::init_cluster() should be suffic

相关标签:
3条回答
  • 2021-01-04 00:15

    As per mentionned by @Maciej, you could try multidplyr:

    ## Install from github
    devtools::install_github("hadley/multidplyr")
    

    Use partition() to split your dataset across multiples cores:

    library(dplyr)
    library(multidplyr)
    test <- data_frame(a=1:3, b=letters[c(1:2, 1)])
    test1 <- partition(test, a)
    

    You'll initialize a 3 cores cluster (one for each a)

    # Initialising 3 core cluster.
    

    Then simply perform your do() call:

    test1 %>%
      do({
        dplyr::data_frame(c = rep(max(.$a)), times = max(.$a))
      })
    

    Which gives:

    #Source: party_df [3 x 3]
    #Groups: a
    #Shards: 3 [1--1 rows]
    #
    #      a     c times
    #  (int) (int) (int)
    #1     1     1     1
    #2     2     2     2
    #3     3     3     3
    
    0 讨论(0)
  • 2021-01-04 00:16

    According to https://twitter.com/cboettig/status/588068454239830017 this feature does not seem to be currently supported.

    0 讨论(0)
  • 2021-01-04 00:31

    You could check Hadley's new package multidplyr.

    0 讨论(0)
提交回复
热议问题