Succinct way to summarize different columns with different functions

后端 未结 4 1292
既然无缘
既然无缘 2021-01-13 15:21

My question builds on a similar one by imposing an additional constraint that the name of each variable should appear only once.

Consider a data frame



        
相关标签:
4条回答
  • 2021-01-13 15:48

    Here's a hacky function that uses unexported functions from dplyr so it is not future proof, but you can specify a different summary for each column.

    summarise_with <- function(.tbl, .funs) {
      funs <- enquo(.funs)
      syms <- syms(tbl_vars(.tbl))
      calls <- dplyr:::as_fun_list(.funs, funs, caller_env())
      stopifnot(length(syms)==length(calls))
      cols <- purrr::map2(calls, syms, ~dplyr:::expr_substitute(.x, quote(.), .y))
      cols <- purrr::set_names(cols, purrr::map_chr(syms, rlang::as_string))
      summarize(.tbl, !!!cols)
    }
    

    Then you could do

    df %>% summarise_with(list(mean, sum))
    

    and not have to type the column names at all.

    0 讨论(0)
  • 2021-01-13 15:50

    It seems like you can use map2 for this.

    map2_dfc( df[v], f, ~.y(.x))
    
    # # A tibble: 1 x 2
    #   potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
    #                                             <dbl>                        <int>
    # 1                                             5.5                          255
    
    0 讨论(0)
  • 2021-01-13 15:52

    I propose 2 tricks to solve this issue, see the code and some details for both solutions at the bottom :

    A function .at that returns results for for groups of variables (here only one variable by group) that we can then unsplice, so we benefit from both worlds, summarize and summarize_at :

    df %>% summarize(
      !!!.at(vars(potentially_long_name_i_dont_want_to_type_twice), mean),
      !!!.at(vars(another_annoyingly_long_name), sum))
    
    # # A tibble: 1 x 2
    #     potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
    #                                               <dbl>                        <dbl>
    #   1                                             5.5                          255
    

    An adverb to summarize, with a dollar notation shorthand.

    df %>%
      ..flx$summarize(potentially_long_name_i_dont_want_to_type_twice = ~mean(.),
                      another_annoyingly_long_name = ~sum(.))
    
    # # A tibble: 1 x 2
    #     potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
    #                                               <dbl>                        <int>
    #   1                                             5.5                          255
    

    code for .at

    It has to be used in a pipe because it uses the . in the parent environment, messy but it works.

    .at <- function(.vars, .funs, ...) {
      in_a_piped_fun <- exists(".",parent.frame()) &&
        length(ls(envir=parent.frame(), all.names = TRUE)) == 1
      if (!in_a_piped_fun)
        stop(".at() must be called as an argument to a piped function")
      .tbl <- try(eval.parent(quote(.)))
      dplyr:::manip_at(
        .tbl, .vars, .funs, rlang::enquo(.funs), rlang:::caller_env(),
        .include_group_vars = TRUE, ...)
    }
    

    I designed it to combine summarize and summarize_at :

    df %>% summarize(
      !!!.at(vars(potentially_long_name_i_dont_want_to_type_twice), list(foo=min, bar = max)),
      !!!.at(vars(another_annoyingly_long_name), median))
    
    # # A tibble: 1 x 3
    #       foo   bar another_annoyingly_long_name
    #     <dbl> <dbl>                        <dbl>
    #   1     1    10                         25.5
    

    code for ..flx

    ..flx outputs a function that replaces its formula arguments such as a = ~mean(.) by calls a = purrr::as_mapper(~mean(.))(a) before running. Convenient with summarize and mutate because a column cannot be a formula so there can't be any conflict.

    I like to use the dollar notation as a shorthand and to have names starting with .. so I can name those "tags" (and give them a class "tag") and see them as different objects (still experimenting with this). ..flx(summarize)(...) will work as well though.

    ..flx <- function(fun){
      function(...){
        mc <- match.call()
        mc[[1]] <- tail(mc[[1]],1)[[1]]
        mc[] <- imap(mc,~if(is.call(.) && identical(.[[1]],quote(`~`))) {
          rlang::expr(purrr::as_mapper(!!.)(!!sym(.y))) 
        } else .)
        eval.parent(mc)
      }
    }
    
    class(..flx) <- "tag"
    
    `$.tag` <- function(e1, e2){
      # change original call so x$y, which is `$.tag`(tag=x, data=y), becomes x(y)
      mc <- match.call()
      mc[[1]] <- mc[[2]]
      mc[[2]] <- NULL
      names(mc) <- NULL
      # evaluate it in parent env
      eval.parent(mc)
    }
    
    0 讨论(0)
  • 2021-01-13 16:03

    Use .[[i]] and !!names(.)[i]:= to refer to the ith column and its name.

    library(tibble)
    library(dplyr)
    library(rlang)
    
    df %>% summarize(!!names(.)[1] := mean(.[[1]]), !!names(.)[2] := sum(.[[2]])) 
    

    giving:

    # A tibble: 1 x 2
      potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
                                                <dbl>                        <int>
    1                                             5.5                          255
    

    Update

    If df were grouped (it is not in the question so this is not needed) then surround summarize with a do like this:

    library(dplyr)
    library(rlang)
    library(tibble)
    
    df2 <- tibble(a = 1:10, b = 11:20, g = rep(1:2, each = 5))
    
    df2 %>%
      group_by(g) %>%
      do(summarize(., !!names(.)[1] := mean(.[[1]]), !!names(.)[2] := sum(.[[2]]))) %>%
      ungroup
    

    giving:

    # A tibble: 2 x 3
          g     a     b
      <int> <dbl> <int>
    1     1     3    65
    2     2     8    90
    
    0 讨论(0)
提交回复
热议问题