when to use map() function and when to use summarise_at()/mutate_at()

后端未结

关注

 2  672

Can anyone give a suggestion regarding when to use the map() (all map_..() functions) and when to use summarise_at()/mutate_at()?

相关标签:

2条回答

名媛妹妹

2021-01-02 17:26

The biggest difference between {dplyr} and {purrr} is that {dplyr} is designed to work on data.frames only, and {purrr} is designed to work on every kind of lists. Data.frames being lists, you can also use {purrr} for iterating on a data.frame.

map_chr(iris, class)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
   "numeric"    "numeric"    "numeric"    "numeric"     "factor"

summarise_at and map_at do not exactly behave the same: summarise_at just return the summary you're looking for, map_at return all the data.frame as a list, with the modification done where you asked it :

> library(purrr)
> library(dplyr)
> small_iris <- sample_n(iris, 5)
> map_at(small_iris, c("Sepal.Length", "Sepal.Width"), mean)
$Sepal.Length
[1] 6.58

$Sepal.Width
[1] 3.2

$Petal.Length
[1] 6.7 1.3 5.7 4.3 4.7

$Petal.Width
[1] 2.0 0.4 2.1 1.3 1.5

$Species
[1] virginica  setosa     virginica  versicolor versicolor
Levels: setosa versicolor virginica

> summarise_at(small_iris, c("Sepal.Length", "Sepal.Width"), mean)
  Sepal.Length Sepal.Width
1         6.58         3.2

map_at always return a list, mutate_at always a data.frame :

> map_at(small_iris, c("Sepal.Length", "Sepal.Width"), ~ .x / 10)
$Sepal.Length
[1] 0.77 0.54 0.67 0.64 0.67

$Sepal.Width
[1] 0.28 0.39 0.33 0.29 0.31

$Petal.Length
[1] 6.7 1.3 5.7 4.3 4.7

$Petal.Width
[1] 2.0 0.4 2.1 1.3 1.5

$Species
[1] virginica  setosa     virginica  versicolor versicolor
Levels: setosa versicolor virginica

> mutate_at(small_iris, c("Sepal.Length", "Sepal.Width"), ~ .x / 10)
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1         0.77        0.28          6.7         2.0  virginica
2         0.54        0.39          1.3         0.4     setosa
3         0.67        0.33          5.7         2.1  virginica
4         0.64        0.29          4.3         1.3 versicolor
5         0.67        0.31          4.7         1.5 versicolor

So to sum up on your first question, if you are thinking about doing operation "column-wise" on a non-nested df and want to have a data.frame as a result, you should go for {dplyr}.

Regarding nested column, you have to combine group_by(), nest() from {tidyr}, mutate() and map(). What you're doing here is creating a smaller version of your dataframe that will contain a column which is a list of data.frames. Then, you're going to use map() to iterate over the elements inside this new column.

Here is an example with our beloved iris:

library(tidyr)

iris_n <- iris %>% 
  group_by(Species) %>% 
  nest()
iris_n
# A tibble: 3 x 2
  Species    data             
  <fct>      <list>           
1 setosa     <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>
3 virginica  <tibble [50 × 4]>

Here, the new object is a data.frame with the colum data being a list of smaller data.frames, one by Species (the factor we specified in group_by()). Then, we can iterate on this column by simply doing :

map(iris_n$data, ~ lm(Sepal.Length ~ Sepal.Width, data = .x))
[[1]]

Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = .x)

Coefficients:
(Intercept)  Sepal.Width  
     2.6390       0.6905  


[[2]]

Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = .x)

Coefficients:
(Intercept)  Sepal.Width  
     3.5397       0.8651  


[[3]]

Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = .x)

Coefficients:
(Intercept)  Sepal.Width  
     3.9068       0.9015

But the idea is to keep everything inside a data.frame, so we can use mutate to create a column that will keep this new list of lm results:

iris_n %>%
  mutate(lm = map(data, ~ lm(Sepal.Length ~ Sepal.Width, data = .x)))
# A tibble: 3 x 3
  Species    data              lm      
  <fct>      <list>            <list>  
1 setosa     <tibble [50 × 4]> <S3: lm>
2 versicolor <tibble [50 × 4]> <S3: lm>
3 virginica  <tibble [50 × 4]> <S3: lm>

So you can run several mutate() to get the r.squared for e.g:

iris_n %>%
  mutate(lm = map(data, ~ lm(Sepal.Length ~ Sepal.Width, data = .x)), 
         lm = map(lm, summary), 
         r_squared = map_dbl(lm, "r.squared")) 
# A tibble: 3 x 4
  Species    data              lm               r_squared
  <fct>      <list>            <list>               <dbl>
1 setosa     <tibble [50 × 4]> <S3: summary.lm>     0.551
2 versicolor <tibble [50 × 4]> <S3: summary.lm>     0.277
3 virginica  <tibble [50 × 4]> <S3: summary.lm>     0.209

But a more efficient way is to use compose() from {purrr} to build a function that will do it once, instead of repeating the mutate().

get_rsquared <- compose(as_mapper("r.squared"), summary, lm)

iris_n %>%
  mutate(lm = map_dbl(data, ~ get_rsquared(Sepal.Length ~ Sepal.Width, data = .x)))
# A tibble: 3 x 3
  Species    data                 lm
  <fct>      <list>            <dbl>
1 setosa     <tibble [50 × 4]> 0.551
2 versicolor <tibble [50 × 4]> 0.277
3 virginica  <tibble [50 × 4]> 0.209

If you know you'll always be using Sepal.Length ~ Sepal.Width, you can even prefill lm() with partial():

pr_lm <- partial(lm, formula = Sepal.Length ~ Sepal.Width)
get_rsquared <- compose(as_mapper("r.squared"), summary, pr_lm)

iris_n %>%
  mutate(lm = map_dbl(data, get_rsquared))
# A tibble: 3 x 3
  Species    data                 lm
  <fct>      <list>            <dbl>
1 setosa     <tibble [50 × 4]> 0.551
2 versicolor <tibble [50 × 4]> 0.277
3 virginica  <tibble [50 × 4]> 0.209

Regarding the resources, I've written a series of blogpost on {purrr} you can check: https://colinfay.me/tags/#purrr

0 讨论(0)

一整个雨季

2021-01-02 17:35

Colin gives a great self-contained answer. Since you asked for more resources on using multiple models with tibbles, I'd also like to add the Many Models chapter of R 4 Data Science which gives a broad overview of creating, simplifying, and modeling with list-columns. http://r4ds.had.co.nz/many-models.html

0 讨论(0)
发布评论:

提交评论
- 加载中...