问题
data set exist data with age, gender, state, income, group
. Group represents the group that each user belongs to:
group gender state age income
1 3 Female CA 33 $75,000 - $99,999
2 3 Male MA 41 $50,000 - $74,999
3 3 Male KY 32 $35,000 - $49,999
4 2 Female CA 23 $35,000 - $49,999
5 3 Male KY 25 $50,000 - $74,999
6 3 Male MA 21 $75,000 - $99,999
7 3 Female CA 33 $75,000 - $99,999
8 3 Male MA 41 $50,000 - $74,999
9 3 Male KY 32 $35,000 - $49,999
10 2 Female CA 23 $35,000 - $49,999
11 3 Male KY 25 $50,000 - $74,999
12 3 Female MA 21 $75,000 - $99,999
Above is dummy data and goal is to get the concept correct.
The goal is to group by group, gender, income
and get the count and for each group get the mean age from the users who belong to that group. Then set the data in following structure: "Expanded Version"
group male female CA MA KY $35,000 - $49,999 $50,000 - $74,999 $75,000 - $99,999 mean_age
2 0 2 2 0 0 2 1 0 23
...
Here are the attempts: using dplyr
> data %>% group_by(group,
+ gender,
+ state,
+ income) %>%
+ summarize(n()) %>%
+ mutate(mean_age = mean(age))
I was also exploring spread
function.
回答1:
You can do both the count and mean in one call to summarize()
:
library(dplyr)
data %>% group_by(group,
gender,
state,
income) %>%
summarize(count = n(), mean_age = mean(age))
For the wide data, the variable names in your sample won't uniquely identify what a given data point means since the unique units are group X gender X state X income
but it only has one row per group
.
Since you have two summaries, the summary type is an additional layer to the unique identification. So to get everything in one row you would have variable names like [group]_[gender]_[state]_[income]_[summary]
. For example, 2_Female_CA_$35,000 - $49,999_count
.
There may be a better wide shape - what type of calculations are you doing on the wide data frame?
回答2:
In addition to @treysp's answer you could use unite
and spread
to create a wide (and unwieldy) table. (I'm using as.data.frame()
only to force printing all columns).
require(tidyverse);
df %>%
group_by(group, gender, state, income) %>%
summarize(n = n(), mean_age = mean(age)) %>%
unite(key, gender, state, income) %>%
spread(key, n) %>% as.data.frame();
# group mean_age Female_CA_$35,000 - $49,999 Female_CA_$75,000 - $99,999
#1 2 23 2 NA
#2 3 21 NA NA
#3 3 25 NA NA
#4 3 32 NA NA
#5 3 33 NA 2
#6 3 41 NA NA
# Female_MA_$75,000 - $99,999 Male_KY_$35,000 - $49,999
#1 NA NA
#2 1 NA
#3 NA NA
#4 NA 2
#5 NA NA
#6 NA NA
# Male_KY_$50,000 - $74,999 Male_MA_$50,000 - $74,999 Male_MA_$75,000 - $99,999
#1 NA NA NA
#2 NA NA 1
#3 2 NA NA
#4 NA NA NA
#5 NA NA NA
#6 NA 2 NA
#
Sample data
df <- read.table(text =
"group gender state age income
1 3 Female CA 33 '$75,000 - $99,999'
2 3 Male MA 41 '$50,000 - $74,999'
3 3 Male KY 32 '$35,000 - $49,999'
4 2 Female CA 23 '$35,000 - $49,999'
5 3 Male KY 25 '$50,000 - $74,999'
6 3 Male MA 21 '$75,000 - $99,999'
7 3 Female CA 33 '$75,000 - $99,999'
8 3 Male MA 41 '$50,000 - $74,999'
9 3 Male KY 32 '$35,000 - $49,999'
10 2 Female CA 23 '$35,000 - $49,999'
11 3 Male KY 25 '$50,000 - $74,999'
12 3 Female MA 21 '$75,000 - $99,999'", header = T, row.names = 1)
来源:https://stackoverflow.com/questions/48755921/r-group-by-multiple-columns-and-mean-value-per-each-group-based-on-different-col