问题
When using broom I was used to combine dplyr::group_by
and dplyr::do
to perform actions on grouped data thanks to @drob.
For example, fitting a linear model to cars depending on their gear system:
library("dplyr")
library("tidyr")
library("broom")
# using do()
mtcars %>%
group_by(am) %>%
do(tidy(lm(mpg ~ wt, data = .)))
# Source: local data frame [4 x 6]
# Groups: am [2]
# am term estimate std.error statistic p.value
# (dbl) (chr) (dbl) (dbl) (dbl) (dbl)
# 1 0 (Intercept) 31.416055 2.9467213 10.661360 6.007748e-09
# 2 0 wt -3.785908 0.7665567 -4.938848 1.245595e-04
# 3 1 (Intercept) 46.294478 3.1198212 14.838824 1.276849e-08
# 4 1 wt -9.084268 1.2565727 -7.229401 1.687904e-05
After reading the recent post from @hadley about tidyr v0.4.1
I discovered that the same thing could be achieved using nest()
and purrr::map()
Same example as before:
by_am <- mtcars %>%
group_by(am) %>%
nest() %>%
mutate(model = purrr::map(data, ~ lm(mpg ~ wt, data = .)))
by_am %>%
unnest(model %>% purrr::map(tidy))
# Source: local data frame [4 x 6]
# am term estimate std.error statistic p.value
# (dbl) (chr) (dbl) (dbl) (dbl) (dbl)
# 1 1 (Intercept) 46.294478 3.1198212 14.838824 1.276849e-08
# 2 1 wt -9.084268 1.2565727 -7.229401 1.687904e-05
# 3 0 (Intercept) 31.416055 2.9467213 10.661360 6.007748e-09
# 4 0 wt -3.785908 0.7665567 -4.938848 1.245595e-04
The ordering changed, but results are the same.
Given both largely address the same use case, I am wondering whether are both approaches going to be supported going forward. Will method become the canonical tidyverse
way?
If one is not considered canonical, what use case(s) require that both approaches continues to be supported?
From my short experience:
- do
- progress bar, nice when many models are computed.
- @Axeman comment: can be parallelized using
multidplyr
- smaller object, but need to re-run if we want
broom::glance
fx.
- map
- data, subsets and models are kept within one
tbl_df
- easy to extract another component of models, even if
unnest()
takes a bit of time.
- data, subsets and models are kept within one
If you have some insights / remarks, will be happy to have some feedback.
来源:https://stackoverflow.com/questions/35505187/comparison-between-dplyrdo-purrrmap-what-advantages