Plotting binned data using sum instead of count

后端 未结 2 1424
春和景丽
春和景丽 2021-01-07 09:28

I\'ve tried to search for an answer, but can\'t seem to find the right one that does the job for me.

I have a dataset (data) with two variables: people\

相关标签:
2条回答
  • 2021-01-07 10:06

    We can use the aggregate function and then use the ggplot2 package. I don't make too many barplots in base R these days so I'm not sure of the best way to do it without loading ggplot2:

    create sample data

    #data
    set.seed(123)
    dat <- data.frame(age = sample(20:50, 200, replace = TRUE),
                      awards = rpois(200, 3))
    head(dat)
      age awards
    1  28      2
    2  44      6
    3  32      3
    4  47      3
    5  49      2
    6  21      5
    

    By age

    #aggregate
    
    sum_by_age <- aggregate(awards ~ age, data = dat, FUN = sum)
    
    library(ggplot2)
    
    ggplot(sum_by_age, aes(x = age, y = awards))+
        geom_bar(stat = 'identity')
    

    By age group

    #create groups
    
    dat$age_group <- ifelse(dat$age <= 30, '20-30',
                            ifelse(dat$age <= 40, '30-40',
                                   '41 +'))
    
    sum_by_age_group <- aggregate(awards ~ age_group, data = dat, FUN = sum)
    
    ggplot(sum_by_age_group, aes(x = age_group, y = awards))+
        geom_bar(stat = 'identity')
    

    Note

    We could skip the aggregate step altogether and just use:

    ggplot(dat, aes(x = age, y = awards)) + geom_bar(stat = 'identity')
    

    but I don't prefer that way because I think having an intermediate data step may be useful within your analytical pipeline for comparisons other than visualizing.

    0 讨论(0)
  • 2021-01-07 10:18

    For completeness, I am adding the base R solution to @bouncyball's great answer. I will use their synthetic data, but I will use cut to create the age groups before aggregation.

    # Creates data for plotting
    > set.seed(123)
    > dat <- data.frame(age = sample(20:50, 200, replace = TRUE),
                        awards = rpois(200, 3))
    
    # Created a new column containing the age groups
    > dat[["ageGroups"]] <- cut(dat[["age"]], c(-Inf, 20, 30, 40, Inf),
                                right = FALSE)
    

    cut will divide up a set of numeric data based on breaks defined in the second argument. right = FALSE flips the breaks so values the groups would include the lower values rather than the upper ones (ie 20 <= x < 30 rather than the default of 20 < x <= 30). The groups do not have to be equally spaced. If you do not want to include data above or below a certain value, simply remove the Inf from the end or -Inf from the beginning respectively, and the function will return <NA> instead. If you would like to give your groups names, you can do so with the labels argument.

    Now we can aggregate based on the groups we created.

    > (summedGroups <- aggregate(awards ~ ageGroups, dat, FUN = sum))
      ageGroups awards
    1   [20,30)    188
    2   [30,40)    212
    3 [40, Inf)    194
    

    Finally, we can plot these data using the barplot function. The key here is to use names for the age groups.

    > barplot(summedGroups[["awards"]], names = summedGroups[["ageGroups"]])
    

    0 讨论(0)
提交回复
热议问题