Relative frequencies / proportions with dplyr

前端 未结 9 2286
灰色年华
灰色年华 2020-11-22 09:25

Suppose I want to calculate the proportion of different values within each group. For example, using the mtcars data, how do I calculate the relative f

相关标签:
9条回答
  • 2020-11-22 09:29

    I wrote a small function for this repeating task:

    count_pct <- function(df) {
      return(
        df %>%
          tally %>% 
          mutate(n_pct = 100*n/sum(n))
      )
    }
    

    I can then use it like:

    mtcars %>% 
      group_by(cyl) %>% 
      count_pct
    

    It returns:

    # A tibble: 3 x 3
        cyl     n n_pct
      <dbl> <int> <dbl>
    1     4    11  34.4
    2     6     7  21.9
    3     8    14  43.8
    
    0 讨论(0)
  • 2020-11-22 09:32

    You can use count() function, which has however a different behaviour depending on the version of dplyr:

    • dplyr 0.7.1: returns an ungrouped table: you need to group again by am

    • dplyr < 0.7.1: returns a grouped table, so no need to group again, although you might want to ungroup() for later manipulations

    dplyr 0.7.1

    mtcars %>%
      count(am, gear) %>%
      group_by(am) %>%
      mutate(freq = n / sum(n))
    

    dplyr < 0.7.1

    mtcars %>%
      count(am, gear) %>%
      mutate(freq = n / sum(n))
    

    This results into a grouped table, if you want to use it for further analysis, it might be useful to remove the grouped attribute with ungroup().

    0 讨论(0)
  • 2020-11-22 09:34

    Here is a general function implementing Henrik's solution on dplyr 0.7.1.

    freq_table <- function(x, 
                           group_var, 
                           prop_var) {
      group_var <- enquo(group_var)
      prop_var  <- enquo(prop_var)
      x %>% 
        group_by(!!group_var, !!prop_var) %>% 
        summarise(n = n()) %>% 
        mutate(freq = n /sum(n)) %>% 
        ungroup
    }
    
    0 讨论(0)
  • 2020-11-22 09:36

    Here is a base R answer using aggregate and ave :

    df1 <- with(mtcars, aggregate(list(n = mpg), list(am = am, gear = gear), length))
    df1$prop <- with(df1, n/ave(n, am, FUN = sum))
    #Also with prop.table
    #df1$prop <- with(df1, ave(n, am, FUN = prop.table))
    df1
    
    #  am gear  n      prop
    #1  0    3 15 0.7894737
    #2  0    4  4 0.2105263
    #3  1    4  8 0.6153846
    #4  1    5  5 0.3846154 
    

    We can also use prop.table but the output displays differently.

    prop.table(table(mtcars$am, mtcars$gear), 1)
       
    #            3         4         5
    #  0 0.7894737 0.2105263 0.0000000
    #  1 0.0000000 0.6153846 0.3846154
    
    0 讨论(0)
  • Despite the many answers, one more approach which uses prop.table in combination with dplyr or data.table.

    library("dplyr")
    mtcars %>%
        group_by(am, gear) %>%
        summarise(n = n()) %>%
        mutate(freq = prop.table(n))
    
    library("data.table")
    cars_dt <- as.data.table(mtcars)
    cars_dt[, .(n = .N), keyby = .(am, gear)][, freq := prop.table(n) , by = "am"]
    
    0 讨论(0)
  • 2020-11-22 09:43

    @Henrik's is better for usability as this will make the column character and no longer numeric but matches what you asked for...

    mtcars %>%
      group_by (am, gear) %>%
      summarise (n=n()) %>%
      mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%"))
    
    ##   am gear  n rel.freq
    ## 1  0    3 15      79%
    ## 2  0    4  4      21%
    ## 3  1    4  8      62%
    ## 4  1    5  5      38%
    

    EDIT Because Spacedman asked for it :-)

    as.rel_freq <- function(x, rel_freq_col = "rel.freq", ...) {
        class(x) <- c("rel_freq", class(x))
        attributes(x)[["rel_freq_col"]] <- rel_freq_col
        x
    }
    
    print.rel_freq <- function(x, ...) {
        freq_col <- attributes(x)[["rel_freq_col"]]
        x[[freq_col]] <- paste0(round(100 * x[[freq_col]], 0), "%")   
        class(x) <- class(x)[!class(x)%in% "rel_freq"]
        print(x)
    }
    
    mtcars %>%
      group_by (am, gear) %>%
      summarise (n=n()) %>%
      mutate(rel.freq = n/sum(n)) %>%
      as.rel_freq()
    
    ## Source: local data frame [4 x 4]
    ## Groups: am
    ## 
    ##   am gear  n rel.freq
    ## 1  0    3 15      79%
    ## 2  0    4  4      21%
    ## 3  1    4  8      62%
    ## 4  1    5  5      38%
    
    0 讨论(0)
提交回复
热议问题