when to use map() function and when to use summarise_at()/mutate_at()

后端 未结 2 672
时光取名叫无心
时光取名叫无心 2021-01-02 16:54

Can anyone give a suggestion regarding when to use the map() (all map_..() functions) and when to use summarise_at()/mutate_at()?

相关标签:
2条回答
  • 2021-01-02 17:26

    The biggest difference between {dplyr} and {purrr} is that {dplyr} is designed to work on data.frames only, and {purrr} is designed to work on every kind of lists. Data.frames being lists, you can also use {purrr} for iterating on a data.frame.

    map_chr(iris, class)
    Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
       "numeric"    "numeric"    "numeric"    "numeric"     "factor" 
    

    summarise_at and map_at do not exactly behave the same: summarise_at just return the summary you're looking for, map_at return all the data.frame as a list, with the modification done where you asked it :

    > library(purrr)
    > library(dplyr)
    > small_iris <- sample_n(iris, 5)
    > map_at(small_iris, c("Sepal.Length", "Sepal.Width"), mean)
    $Sepal.Length
    [1] 6.58
    
    $Sepal.Width
    [1] 3.2
    
    $Petal.Length
    [1] 6.7 1.3 5.7 4.3 4.7
    
    $Petal.Width
    [1] 2.0 0.4 2.1 1.3 1.5
    
    $Species
    [1] virginica  setosa     virginica  versicolor versicolor
    Levels: setosa versicolor virginica
    
    > summarise_at(small_iris, c("Sepal.Length", "Sepal.Width"), mean)
      Sepal.Length Sepal.Width
    1         6.58         3.2
    

    map_at always return a list, mutate_at always a data.frame :

    > map_at(small_iris, c("Sepal.Length", "Sepal.Width"), ~ .x / 10)
    $Sepal.Length
    [1] 0.77 0.54 0.67 0.64 0.67
    
    $Sepal.Width
    [1] 0.28 0.39 0.33 0.29 0.31
    
    $Petal.Length
    [1] 6.7 1.3 5.7 4.3 4.7
    
    $Petal.Width
    [1] 2.0 0.4 2.1 1.3 1.5
    
    $Species
    [1] virginica  setosa     virginica  versicolor versicolor
    Levels: setosa versicolor virginica
    
    > mutate_at(small_iris, c("Sepal.Length", "Sepal.Width"), ~ .x / 10)
      Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
    1         0.77        0.28          6.7         2.0  virginica
    2         0.54        0.39          1.3         0.4     setosa
    3         0.67        0.33          5.7         2.1  virginica
    4         0.64        0.29          4.3         1.3 versicolor
    5         0.67        0.31          4.7         1.5 versicolor
    

    So to sum up on your first question, if you are thinking about doing operation "column-wise" on a non-nested df and want to have a data.frame as a result, you should go for {dplyr}.

    Regarding nested column, you have to combine group_by(), nest() from {tidyr}, mutate() and map(). What you're doing here is creating a smaller version of your dataframe that will contain a column which is a list of data.frames. Then, you're going to use map() to iterate over the elements inside this new column.

    Here is an example with our beloved iris:

    library(tidyr)
    
    iris_n <- iris %>% 
      group_by(Species) %>% 
      nest()
    iris_n
    # A tibble: 3 x 2
      Species    data             
      <fct>      <list>           
    1 setosa     <tibble [50 × 4]>
    2 versicolor <tibble [50 × 4]>
    3 virginica  <tibble [50 × 4]>
    

    Here, the new object is a data.frame with the colum data being a list of smaller data.frames, one by Species (the factor we specified in group_by()). Then, we can iterate on this column by simply doing :

    map(iris_n$data, ~ lm(Sepal.Length ~ Sepal.Width, data = .x))
    [[1]]
    
    Call:
    lm(formula = Sepal.Length ~ Sepal.Width, data = .x)
    
    Coefficients:
    (Intercept)  Sepal.Width  
         2.6390       0.6905  
    
    
    [[2]]
    
    Call:
    lm(formula = Sepal.Length ~ Sepal.Width, data = .x)
    
    Coefficients:
    (Intercept)  Sepal.Width  
         3.5397       0.8651  
    
    
    [[3]]
    
    Call:
    lm(formula = Sepal.Length ~ Sepal.Width, data = .x)
    
    Coefficients:
    (Intercept)  Sepal.Width  
         3.9068       0.9015  
    

    But the idea is to keep everything inside a data.frame, so we can use mutate to create a column that will keep this new list of lm results:

    iris_n %>%
      mutate(lm = map(data, ~ lm(Sepal.Length ~ Sepal.Width, data = .x)))
    # A tibble: 3 x 3
      Species    data              lm      
      <fct>      <list>            <list>  
    1 setosa     <tibble [50 × 4]> <S3: lm>
    2 versicolor <tibble [50 × 4]> <S3: lm>
    3 virginica  <tibble [50 × 4]> <S3: lm>
    

    So you can run several mutate() to get the r.squared for e.g:

    iris_n %>%
      mutate(lm = map(data, ~ lm(Sepal.Length ~ Sepal.Width, data = .x)), 
             lm = map(lm, summary), 
             r_squared = map_dbl(lm, "r.squared")) 
    # A tibble: 3 x 4
      Species    data              lm               r_squared
      <fct>      <list>            <list>               <dbl>
    1 setosa     <tibble [50 × 4]> <S3: summary.lm>     0.551
    2 versicolor <tibble [50 × 4]> <S3: summary.lm>     0.277
    3 virginica  <tibble [50 × 4]> <S3: summary.lm>     0.209
    

    But a more efficient way is to use compose() from {purrr} to build a function that will do it once, instead of repeating the mutate().

    get_rsquared <- compose(as_mapper("r.squared"), summary, lm)
    
    iris_n %>%
      mutate(lm = map_dbl(data, ~ get_rsquared(Sepal.Length ~ Sepal.Width, data = .x)))
    # A tibble: 3 x 3
      Species    data                 lm
      <fct>      <list>            <dbl>
    1 setosa     <tibble [50 × 4]> 0.551
    2 versicolor <tibble [50 × 4]> 0.277
    3 virginica  <tibble [50 × 4]> 0.209
    

    If you know you'll always be using Sepal.Length ~ Sepal.Width, you can even prefill lm() with partial():

    pr_lm <- partial(lm, formula = Sepal.Length ~ Sepal.Width)
    get_rsquared <- compose(as_mapper("r.squared"), summary, pr_lm)
    
    iris_n %>%
      mutate(lm = map_dbl(data, get_rsquared))
    # A tibble: 3 x 3
      Species    data                 lm
      <fct>      <list>            <dbl>
    1 setosa     <tibble [50 × 4]> 0.551
    2 versicolor <tibble [50 × 4]> 0.277
    3 virginica  <tibble [50 × 4]> 0.209
    

    Regarding the resources, I've written a series of blogpost on {purrr} you can check: https://colinfay.me/tags/#purrr

    0 讨论(0)
  • 2021-01-02 17:35

    Colin gives a great self-contained answer. Since you asked for more resources on using multiple models with tibbles, I'd also like to add the Many Models chapter of R 4 Data Science which gives a broad overview of creating, simplifying, and modeling with list-columns. http://r4ds.had.co.nz/many-models.html

    0 讨论(0)
提交回复
热议问题