问题
I'm creating a bunch of basic status reports and one of things I'm finding tedious is adding a total row to all my tables. I'm currently using the Tidyverse approach and this is an example of my current code. What I'm looking for is an option to have a few different levels included by default.
#load into RStudio viewer (not required)
iris = iris
#summary at the group level
summary_grouped = iris %>%
group_by(Species) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))
#summary at the overall level
summary_overall = iris %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
mutate(Species = "Overall")
#append results for report
summary_table = rbind(summary_grouped, summary_overall)
Doing this multiple times over is very tedious. I kind of want:
summary_overall = iris %>%
group_by(Species, total = TRUE) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))
FYI - if you're familiar with SAS I'm looking for the same type of functionality available via a class, ways or types statements in proc means that let me control the level of summarization and get multiple levels in one call.
Any help is appreciated. I know I can create my own function, but was hoping there is something that already exists. I would also prefer to stick with the tidyverse style of programming though I'm not set on that.
回答1:
Another alternative:
library(tidyverse)
iris %>%
mutate_at("Species", as.character) %>%
list(group_by(.,Species), .) %>%
map(~summarize(.,mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))) %>%
bind_rows() %>%
replace_na(list(Species="Overall"))
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <chr> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overall 5.84 4.4
回答2:
library(dplyr)
iris %>%
group_by(Species) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
ungroup() %>%
mutate_at(vars(Species), as.character) %>%
{rbind(.,c("Overal",mean(.$mean_s_length),max(.$max_s_width)))} %>%
mutate_at(vars(-Species), as.double) %>%
mutate_at(vars(Species), as.factor)
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <fct> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overal 5.84 4.4
Created on 2019-06-21 by the reprex package (v0.3.0)
回答3:
One way, also tedious but in one longer pipe, is to put the second summarise instructions in bind_rows
.
The as.character
call avoids a warning:
Warning messages:
1: In bind_rows_(x, .id) :
binding factor and character vector, coercing into character vector
2: In bind_rows_(x, .id) :
binding character and factor vector, coercing into character vector
library(tidyverse)
summary_grouped <- iris %>%
mutate(Species = as.character(Species)) %>%
group_by(Species) %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
bind_rows(iris %>%
summarize(mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)) %>%
mutate(Species = "Overall"))
## A tibble: 4 x 3
# Species mean_s_length max_s_width
# <chr> <dbl> <dbl>
#1 setosa 5.01 4.4
#2 versicolor 5.94 3.4
#3 virginica 6.59 3.8
#4 Overall 5.84 4.4
回答4:
Maybe something like this:
As you want to perform different operations on the same input (iris
), best to map
over the different summary functions and apply to the data.
map_dfr
combines the list outputs using bind_rows
library(dplyr)
library(purrr)
pipe <- . %>%
group_by(Species) %>%
summarize(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))
map_dfr(
list(pipe, . %>% mutate(Species = "Overall") %>% pipe),
exec,
iris)
#> Warning in bind_rows_(x, .id): binding factor and character vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <chr> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overall 5.84 4.4
回答5:
Solution where you need to apply wanted function only once on a double dataset:
library(tidyverse)
iris %>%
rbind(mutate(., Species = "Overall")) %>%
group_by(Species) %>%
summarize(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)
)
# A tibble: 4 x 3
Species mean_s_length max_s_width
<chr> <dbl> <dbl>
1 Overall 5.84 4.4
2 setosa 5.01 4.4
3 versicolor 5.94 3.4
4 virginica 6.59 3.8
Trick is to pass original dataset with a new group ID (ie Species
): mutate(iris, Species = "Overall")
回答6:
You can write a function which does the same summarize
on an ungroup
ed tibble and rbinds that to the end.
summarize2 <- function(df, ...){
bind_rows(summarise(df, ...), summarize(ungroup(df), ...))
}
iris %>%
group_by(Species) %>%
summarize2(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)
)
# # A tibble: 4 x 3
# Species mean_s_length max_s_width
# <fct> <dbl> <dbl>
# 1 setosa 5.01 4.4
# 2 versicolor 5.94 3.4
# 3 virginica 6.59 3.8
# 4 NA 5.84 4.4
You could add some logic for what the "Overall" groups should be named if you want
summarize2 <- function(df, ...){
s1 <- summarise(df, ...)
s2 <- summarize(ungroup(df), ...)
for(v in group_vars(s1)){
if(is.factor(s1[[v]]))
s1[[v]] <- as.character(s1[[v]])
if(is.character(s1[[v]]))
s2[[v]] <- 'Overall'
else if(is.numeric(s1[[v]]))
s2[[v]] <- -Inf
}
bind_rows(s1, s2)
}
iris %>%
group_by(Species, g = Petal.Length %/% 1) %>%
summarize2(
mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width)
)
# # Groups: Species [4]
# Species g mean_s_length max_s_width
# <chr> <dbl> <dbl> <dbl>
# 1 setosa 1 5.01 4.4
# 2 versicolor 3 5.35 2.9
# 3 versicolor 4 6.09 3.4
# 4 versicolor 5 6.35 3
# 5 virginica 4 5.85 3
# 6 virginica 5 6.44 3.4
# 7 virginica 6 7.43 3.8
# 8 Overall -Inf 5.84 4.4
来源:https://stackoverflow.com/questions/56709648/summarize-data-at-different-aggregate-levels-r-and-tidyverse