Succinct way to summarize different columns with different functions

后端未结

关注

 4  1309

My question builds on a similar one by imposing an additional constraint that the name of each variable should appear only once.

Consider a data frame

相关标签:

4条回答

醉梦人生

2021-01-13 15:48

Here's a hacky function that uses unexported functions from dplyr so it is not future proof, but you can specify a different summary for each column.

summarise_with <- function(.tbl, .funs) {
  funs <- enquo(.funs)
  syms <- syms(tbl_vars(.tbl))
  calls <- dplyr:::as_fun_list(.funs, funs, caller_env())
  stopifnot(length(syms)==length(calls))
  cols <- purrr::map2(calls, syms, ~dplyr:::expr_substitute(.x, quote(.), .y))
  cols <- purrr::set_names(cols, purrr::map_chr(syms, rlang::as_string))
  summarize(.tbl, !!!cols)
}

Then you could do

df %>% summarise_with(list(mean, sum))

and not have to type the column names at all.

0 讨论(0)

独厮守ぢ

2021-01-13 15:50

It seems like you can use map2 for this.

map2_dfc( df[v], f, ~.y(.x))

# # A tibble: 1 x 2
#   potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
#                                             <dbl>                        <int>
# 1                                             5.5                          255

0 讨论(0)

爱一瞬间的悲伤

2021-01-13 15:52

I propose 2 tricks to solve this issue, see the code and some details for both solutions at the bottom :

A function .at that returns results for for groups of variables (here only one variable by group) that we can then unsplice, so we benefit from both worlds, summarize and summarize_at :

df %>% summarize(
  !!!.at(vars(potentially_long_name_i_dont_want_to_type_twice), mean),
  !!!.at(vars(another_annoyingly_long_name), sum))

# # A tibble: 1 x 2
#     potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
#                                               <dbl>                        <dbl>
#   1                                             5.5                          255

An adverb to summarize, with a dollar notation shorthand.

df %>%
  ..flx$summarize(potentially_long_name_i_dont_want_to_type_twice = ~mean(.),
                  another_annoyingly_long_name = ~sum(.))

# # A tibble: 1 x 2
#     potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
#                                               <dbl>                        <int>
#   1                                             5.5                          255

code for .at

It has to be used in a pipe because it uses the . in the parent environment, messy but it works.

.at <- function(.vars, .funs, ...) {
  in_a_piped_fun <- exists(".",parent.frame()) &&
    length(ls(envir=parent.frame(), all.names = TRUE)) == 1
  if (!in_a_piped_fun)
    stop(".at() must be called as an argument to a piped function")
  .tbl <- try(eval.parent(quote(.)))
  dplyr:::manip_at(
    .tbl, .vars, .funs, rlang::enquo(.funs), rlang:::caller_env(),
    .include_group_vars = TRUE, ...)
}

I designed it to combine summarize and summarize_at :

df %>% summarize(
  !!!.at(vars(potentially_long_name_i_dont_want_to_type_twice), list(foo=min, bar = max)),
  !!!.at(vars(another_annoyingly_long_name), median))

# # A tibble: 1 x 3
#       foo   bar another_annoyingly_long_name
#     <dbl> <dbl>                        <dbl>
#   1     1    10                         25.5

code for ..flx

..flx outputs a function that replaces its formula arguments such as a = ~mean(.) by calls a = purrr::as_mapper(~mean(.))(a) before running. Convenient with summarize and mutate because a column cannot be a formula so there can't be any conflict.

I like to use the dollar notation as a shorthand and to have names starting with .. so I can name those "tags" (and give them a class "tag") and see them as different objects (still experimenting with this). ..flx(summarize)(...) will work as well though.

..flx <- function(fun){
  function(...){
    mc <- match.call()
    mc[[1]] <- tail(mc[[1]],1)[[1]]
    mc[] <- imap(mc,~if(is.call(.) && identical(.[[1]],quote(`~`))) {
      rlang::expr(purrr::as_mapper(!!.)(!!sym(.y))) 
    } else .)
    eval.parent(mc)
  }
}

class(..flx) <- "tag"

`$.tag` <- function(e1, e2){
  # change original call so x$y, which is `$.tag`(tag=x, data=y), becomes x(y)
  mc <- match.call()
  mc[[1]] <- mc[[2]]
  mc[[2]] <- NULL
  names(mc) <- NULL
  # evaluate it in parent env
  eval.parent(mc)
}

0 讨论(0)

醉梦人生

2021-01-13 16:03

Use .[[i]] and !!names(.)[i]:= to refer to the ith column and its name.

library(tibble)
library(dplyr)
library(rlang)

df %>% summarize(!!names(.)[1] := mean(.[[1]]), !!names(.)[2] := sum(.[[2]]))

giving:

# A tibble: 1 x 2
  potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
                                            <dbl>                        <int>
1                                             5.5                          255

Update

If df were grouped (it is not in the question so this is not needed) then surround summarize with a do like this:

library(dplyr)
library(rlang)
library(tibble)

df2 <- tibble(a = 1:10, b = 11:20, g = rep(1:2, each = 5))

df2 %>%
  group_by(g) %>%
  do(summarize(., !!names(.)[1] := mean(.[[1]]), !!names(.)[2] := sum(.[[2]]))) %>%
  ungroup

giving:

# A tibble: 2 x 3
      g     a     b
  <int> <dbl> <int>
1     1     3    65
2     2     8    90

0 讨论(0)