Relative frequencies / proportions with dplyr

前端 未结 9 2287
灰色年华
灰色年华 2020-11-22 09:25

Suppose I want to calculate the proportion of different values within each group. For example, using the mtcars data, how do I calculate the relative f

相关标签:
9条回答
  • 2020-11-22 09:44

    Try this:

    mtcars %>%
      group_by(am, gear) %>%
      summarise(n = n()) %>%
      mutate(freq = n / sum(n))
    
    #   am gear  n      freq
    # 1  0    3 15 0.7894737
    # 2  0    4  4 0.2105263
    # 3  1    4  8 0.6153846
    # 4  1    5  5 0.3846154
    

    From the dplyr vignette:

    When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset.

    Thus, after the summarise, the last grouping variable specified in group_by, 'gear', is peeled off. In the mutate step, the data is grouped by the remaining grouping variable(s), here 'am'. You may check grouping in each step with groups.

    The outcome of the peeling is of course dependent of the order of the grouping variables in the group_by call. You may wish to do a subsequent group_by(am), to make your code more explicit.

    For rounding and prettification, please refer to the nice answer by @Tyler Rinker.

    0 讨论(0)
  • 2020-11-22 09:50

    This answer is based upon Matifou's answer.

    First I modified it to ensure that I don't get the freq column returned as a scientific notation column by using the scipen option.

    Then I multiple the answer by 100 to get a percent rather than decimal to make the freq column easier to read as a percentage.

    getOption("scipen") 
    options("scipen"=10) 
    mtcars %>%
    count(am, gear) %>% 
    mutate(freq = (n / sum(n)) * 100)
    
    0 讨论(0)
  • 2020-11-22 09:53

    For the sake of completeness of this popular question, since version 1.0.0 of dplyr, parameter .groups controls the grouping structure of the summarise function after group_by summarise help.

    With .groups = "drop_last", summarise drops the last level of grouping. This was the only result obtained before version 1.0.0.

    library(dplyr)
    library(scales)
    
    original <- mtcars %>%
      group_by (am, gear) %>%
      summarise (n=n()) %>%
      mutate(rel.freq =  scales::percent(n/sum(n), accuracy = 0.1))
    #> `summarise()` regrouping output by 'am' (override with `.groups` argument)
    
    original
    #> # A tibble: 4 x 4
    #> # Groups:   am [2]
    #>      am  gear     n rel.freq
    #>   <dbl> <dbl> <int> <chr>   
    #> 1     0     3    15 78.9%   
    #> 2     0     4     4 21.1%   
    #> 3     1     4     8 61.5%   
    #> 4     1     5     5 38.5%
    
    new_drop_last <- mtcars %>%
      group_by (am, gear) %>%
      summarise (n=n(), .groups = "drop_last") %>%
      mutate(rel.freq =  scales::percent(n/sum(n), accuracy = 0.1))
    
    dplyr::all_equal(original, new_drop_last)
    #> [1] TRUE
    

    With .groups = "drop", all levels of grouping are dropped. The result is turned into an independent tibble with no trace of the previous group_by

    # .groups = "drop"
    new_drop <- mtcars %>%
      group_by (am, gear) %>%
      summarise (n=n(), .groups = "drop") %>%
      mutate(rel.freq =  scales::percent(n/sum(n), accuracy = 0.1))
    
    new_drop
    #> # A tibble: 4 x 4
    #>      am  gear     n rel.freq
    #>   <dbl> <dbl> <int> <chr>   
    #> 1     0     3    15 46.9%   
    #> 2     0     4     4 12.5%   
    #> 3     1     4     8 25.0%   
    #> 4     1     5     5 15.6%
    

    If .groups = "keep", same grouping structure as .data (mtcars, in this case). summarise does not peel off any variable used in the group_by.

    Finally, with .groups = "rowwise", each row is it's own group. It is equivalent to "keep" in this situation

    # .groups = "keep"
    new_keep <- mtcars %>%
      group_by (am, gear) %>%
      summarise (n=n(), .groups = "keep") %>%
      mutate(rel.freq =  scales::percent(n/sum(n), accuracy = 0.1))
    
    new_keep
    #> # A tibble: 4 x 4
    #> # Groups:   am, gear [4]
    #>      am  gear     n rel.freq
    #>   <dbl> <dbl> <int> <chr>   
    #> 1     0     3    15 100.0%  
    #> 2     0     4     4 100.0%  
    #> 3     1     4     8 100.0%  
    #> 4     1     5     5 100.0%
    
    # .groups = "rowwise"
    new_rowwise <- mtcars %>%
      group_by (am, gear) %>%
      summarise (n=n(), .groups = "rowwise") %>%
      mutate(rel.freq =  scales::percent(n/sum(n), accuracy = 0.1))
    
    dplyr::all_equal(new_keep, new_rowwise)
    #> [1] TRUE
    

    Another point that can be of interest is that sometimes, after applying group_by and summarise, a summary line can help.

    # create a subtotal line to help readability
    subtotal_am <- mtcars %>%
      group_by (am) %>% 
      summarise (n=n()) %>%
      mutate(gear = NA, rel.freq = 1)
    #> `summarise()` ungrouping output (override with `.groups` argument)
    
    mtcars %>% group_by (am, gear) %>%
      summarise (n=n()) %>% 
      mutate(rel.freq = n/sum(n)) %>%
      bind_rows(subtotal_am) %>%
      arrange(am, gear) %>%
      mutate(rel.freq =  scales::percent(rel.freq, accuracy = 0.1))
    #> `summarise()` regrouping output by 'am' (override with `.groups` argument)
    #> # A tibble: 6 x 4
    #> # Groups:   am [2]
    #>      am  gear     n rel.freq
    #>   <dbl> <dbl> <int> <chr>   
    #> 1     0     3    15 78.9%   
    #> 2     0     4     4 21.1%   
    #> 3     0    NA    19 100.0%  
    #> 4     1     4     8 61.5%   
    #> 5     1     5     5 38.5%   
    #> 6     1    NA    13 100.0%
    

    Created on 2020-11-09 by the reprex package (v0.3.0)

    Hope you find this answer useful.

    0 讨论(0)
提交回复
热议问题