Comparison between dplyr::do / purrr::map, what advantages? [closed]

人走茶凉 提交于 2019-12-17 21:50:25

问题


When using broom I was used to combine dplyr::group_by and dplyr::do to perform actions on grouped data thanks to @drob. For example, fitting a linear model to cars depending on their gear system:

library("dplyr")
library("tidyr")
library("broom")

# using do()
mtcars %>%
  group_by(am) %>%
  do(tidy(lm(mpg ~ wt, data = .)))

# Source: local data frame [4 x 6]
# Groups: am [2]

#     am        term  estimate std.error statistic      p.value
#   (dbl)       (chr)     (dbl)     (dbl)     (dbl)        (dbl)
# 1     0 (Intercept) 31.416055 2.9467213 10.661360 6.007748e-09
# 2     0          wt -3.785908 0.7665567 -4.938848 1.245595e-04
# 3     1 (Intercept) 46.294478 3.1198212 14.838824 1.276849e-08
# 4     1          wt -9.084268 1.2565727 -7.229401 1.687904e-05

After reading the recent post from @hadley about tidyr v0.4.1 I discovered that the same thing could be achieved using nest() and purrr::map()

Same example as before:

by_am <- mtcars %>%
  group_by(am) %>%
  nest() %>%
  mutate(model = purrr::map(data, ~ lm(mpg ~ wt, data = .)))

by_am %>%
  unnest(model %>% purrr::map(tidy))

# Source: local data frame [4 x 6]

#      am        term  estimate std.error statistic      p.value
#   (dbl)       (chr)     (dbl)     (dbl)     (dbl)        (dbl)
# 1     1 (Intercept) 46.294478 3.1198212 14.838824 1.276849e-08
# 2     1          wt -9.084268 1.2565727 -7.229401 1.687904e-05
# 3     0 (Intercept) 31.416055 2.9467213 10.661360 6.007748e-09
# 4     0          wt -3.785908 0.7665567 -4.938848 1.245595e-04

The ordering changed, but results are the same.

Given both largely address the same use case, I am wondering whether are both approaches going to be supported going forward. Will method become the canonical tidyverse way? If one is not considered canonical, what use case(s) require that both approaches continues to be supported?

From my short experience:

  • do
    • progress bar, nice when many models are computed.
    • @Axeman comment: can be parallelized using multidplyr
    • smaller object, but need to re-run if we want broom::glance fx.
  • map
    • data, subsets and models are kept within one tbl_df
    • easy to extract another component of models, even if unnest() takes a bit of time.

If you have some insights / remarks, will be happy to have some feedback.

来源:https://stackoverflow.com/questions/35505187/comparison-between-dplyrdo-purrrmap-what-advantages

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!