I have a dataset for which I want to summarise by mean, but also calculate the max to just 1 of the variables.
Let me start with an example of what I would like to a
Although this is an old question, it remains an interesting problem for which I have two solutions that I believe should be available to whoever finds this page.
Solution one
My own take:
mapply(summarise_at,
.vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"),
.funs = lst(mean, max),
MoreArgs = list(.tbl = iris %>% group_by(Species) %>% filter(Sepal.Length > 5)))
%>% reduce(merge, by = "Species")
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
# 1 setosa 5.314 3.714 1.509 0.2773 0.5
# 2 versicolor 5.998 2.804 4.317 1.3468 1.8
# 3 virginica 6.622 2.984 5.573 2.0327 2.5
Solution two
An elegant solution using package purrr
from the tidyverse itself, inspired by this discussion:
list(.vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"),
.funs = lst("mean" = mean, "max" = max)) %>%
pmap(~ iris %>% group_by(Species) %>% filter(Sepal.Length > 5) %>% summarise_at(.x, .y))
%>% reduce(inner_join, by = "Species")
+ + + # A tibble: 3 x 6
Species Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.31 3.71 1.51 0.277 0.5
2 versicolor 6.00 2.80 4.32 1.35 1.8
3 virginica 6.62 2.98 5.57 2.03 2.5
Short discussion
The data.frame and tibble are the desired result, the last column being the max
of petal.width
and the other ones the means (by group and filter) of all other columns.
Both solutions hinge on three realizations:
summarise_at
accepts as arguments two lists, one of n variables and one of m functions, and applies all m functions to all n variables, therefore producing m X n vectors in a tibble. The solution might thus imply forcing this function to loop in some way across "couples" formed by all variables to which we want one specific function to be applied and the one function, then another group of variables and their own function, and so on!mapply
or the family of functions map2
, pmap
and variations thereof from dplyr
's tidyverse fellow purrr
. Both accept two lists of l elements and perform a given operation on corresponding elements (matched by position) of the two lists. reduce
with inner_join
or just merge
.Note that the means I obtain are different from those of the OP, but they are the means I obtain with his reproducible example as well (maybe we have two different versions of the iris
dataset?).
If you wanted to do something more complex like that, you could write your own version of summarize_at
. With this version you supply triplets of column names, functions, and naming rules. For example
Here's a rough start
my_summarise_at<-function (.tbl, ...)
{
dots <- list(...)
stopifnot(length(dots)%%3==0)
vars <- do.call("append", Map(function(.cols, .funs, .name) {
cols <- select_colwise_names(.tbl, .cols)
funs <- as.fun_list(.funs, .env = parent.frame())
val<-colwise_(.tbl, funs, cols)
names <- sapply(names(val), function(x) gsub("%", x, .name))
setNames(val, names)
}, dots[seq_along(dots)%%3==1], dots[seq_along(dots)%%3==2], dots[seq_along(dots)%%3==0]))
summarise_(.tbl, .dots = vars)
}
environment(my_summarise_at)<-getNamespace("dplyr")
And you can call it with
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
my_summarise_at("Sepal.Length:Petal.Width", mean, "%_mean",
"Petal.Width", max, "%_max")
For the names we just replace the "%" with the default name. The idea is just to dynamically build the summarize_
expression. The summarize_at
function is really just a convenience wrapper around that basic function.
I was looking for something similar and tried the following. It works well and much easier to read than the suggested solutions.
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise(MeanSepalLength=mean(Sepal.Length),
MeanSepalWidth = mean(Sepal.Width),
MeanPetalLength=mean(Petal.Length),
MeanPetalWidth=mean(Petal.Width),
MaxPetalWidth=max(Petal.Width))
# A tibble: 3 x 6
Species MeanSepalLength MeanSepalWidth MeanPetalLength MeanPetalWidth MaxPetalWidth
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 3.43 1.46 0.246 0.6
2 versicolor 5.94 2.77 4.26 1.33 1.8
3 virginica 6.59 2.97 5.55 2.03 2.5
In summarise() part, define your column name and give your column to summarise inside your function of choice.
If you are trying to do everything with dplyr (which might be easier to remember), then you can leverage the new across
function which will be available from dplyr 1.0.0.
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarize(across(Sepal.Length:Petal.Width, mean)) %>%
cbind(iris %>%
group_by(Species) %>%
summarize(across(Petal.Width, max)) %>%
select(-Species)
)
It shows that the only difficulty is to combine two calculations on the same column Petal.Width
on a grouped variable - you have to do the grouping again but can nest it into the cbind
.
This returns correctly the result:
Species Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Width
1 setosa 5.313636 3.713636 1.509091 0.2772727 0.6
2 versicolor 5.997872 2.804255 4.317021 1.3468085 1.8
3 virginica 6.622449 2.983673 5.573469 2.0326531 2.5
If the task would not specify two calculations but only one on the same column Petal.Width
, then this could be elegantly written as:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarize(
across(Sepal.Length:Petal.Length, mean),
across(Petal.Width, max)
)