Categorize numeric variable with mutate

后端 未结 2 1420
猫巷女王i
猫巷女王i 2020-11-29 03:47

I would like to a categorize numeric variable in my data.frame object with the use of dplyr (and have no idea how to do it).

Without

相关标签:
2条回答
  • 2020-11-29 04:38

    The ggplot2 package has 3 functions that work well for these tasks:

    • cut_number(): Makes n groups with (approximately) equal numbers of observation
    • cut_interval(): Makes n groups with equal range
    • cut_width: Makes groups of width width

    My go-to is cut_number() because this uses evenly spaced quantiles for binning observations. Here's an example with skewed data.

    library(tidyverse)
    
    skewed_tbl <- tibble(
        counts = c(1:100, 1:50, 1:20, rep(1:10, 3), 
                   rep(1:5, 5), rep(1:2, 10), rep(1, 20))
        ) %>%
        mutate(
            counts_cut_number   = cut_number(counts, n = 4),
            counts_cut_interval = cut_interval(counts, n = 4),
            counts_cut_width    = cut_width(counts, width = 25)
            ) 
    
    # Data
    skewed_tbl
    #> # A tibble: 265 x 4
    #>    counts counts_cut_number counts_cut_interval counts_cut_width
    #>     <dbl> <fct>             <fct>               <fct>           
    #>  1      1 [1,3]             [1,25.8]            [-12.5,12.5]    
    #>  2      2 [1,3]             [1,25.8]            [-12.5,12.5]    
    #>  3      3 [1,3]             [1,25.8]            [-12.5,12.5]    
    #>  4      4 (3,13]            [1,25.8]            [-12.5,12.5]    
    #>  5      5 (3,13]            [1,25.8]            [-12.5,12.5]    
    #>  6      6 (3,13]            [1,25.8]            [-12.5,12.5]    
    #>  7      7 (3,13]            [1,25.8]            [-12.5,12.5]    
    #>  8      8 (3,13]            [1,25.8]            [-12.5,12.5]    
    #>  9      9 (3,13]            [1,25.8]            [-12.5,12.5]    
    #> 10     10 (3,13]            [1,25.8]            [-12.5,12.5]    
    #> # ... with 255 more rows
    
    summary(skewed_tbl$counts)
    #>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    #>    1.00    3.00   13.00   25.75   42.00  100.00
    
    # Histogram showing skew
    skewed_tbl %>%
        ggplot(aes(counts)) +
        geom_histogram(bins = 30)
    

    # cut_number() evenly distributes observations into bins by quantile
    skewed_tbl %>%
        ggplot(aes(counts_cut_number)) +
        geom_bar()
    

    # cut_interval() evenly splits the interval across the range
    skewed_tbl %>%
        ggplot(aes(counts_cut_interval)) +
        geom_bar()
    

    # cut_width() uses the width = 25 to create bins that are 25 in width
    skewed_tbl %>%
        ggplot(aes(counts_cut_width)) +
        geom_bar()
    

    Created on 2018-11-01 by the reprex package (v0.2.1)

    0 讨论(0)
  • 2020-11-29 04:48
    set.seed(123)
    df <- data.frame(a = rnorm(10), b = rnorm(10))
    
    df %>% mutate(a = cut(a, breaks = quantile(a, probs = seq(0, 1, 0.2))))
    

    giving:

                     a          b
    1  (-0.586,-0.316]  1.2240818
    2   (-0.316,0.094]  0.3598138
    3      (0.68,1.72]  0.4007715
    4   (-0.316,0.094]  0.1106827
    5     (0.094,0.68] -0.5558411
    6      (0.68,1.72]  1.7869131
    7     (0.094,0.68]  0.4978505
    8             <NA> -1.9666172
    9   (-1.27,-0.586]  0.7013559
    10 (-0.586,-0.316] -0.4727914
    
    0 讨论(0)
提交回复
热议问题