Dplyr summarise with list of function and dependence on other data column

问题

Sorry for the terrible title, but it's hard to explain. I have the following data and functions I want to summarize the data with:

library(tidyverse)

# generate data
df <- map(1:4, ~ runif(100)) %>% 
  set_names(c(paste0('V', 1:3), 'threshold')) %>% 
  as_tibble() %>% 
  mutate(group = sample(c('a', 'b'), 100, replace = T))

# generate function list
fun_factory_params <- 1:10
fun_factory <- function(param){
  function(v, threshold){
    sum((v * (threshold >= 1/2))^param)
  }
}
fun_list <- map(fun_factory_params, fun_factory)

df %>% head(n = 5)
      V1     V2     V3 threshold group
   <dbl>  <dbl>  <dbl>     <dbl> <chr>
1 0.631  0.0209 0.0360     0.713 b    
2 0.629  0.674  0.174      0.693 b    
3 0.144  0.358  0.439      0.395 a    
4 0.0695 0.760  0.657      0.810 a    
5 0.545  0.770  0.719      0.388 b

I want to group df by the group variable and summarize V1, V2 and V3 in the following way: for each V of those variables and each value n in fun_factory_params (1 to 10), I want to compute sum((V * (threshold >= 1/2))^n). To have results computed for each n in an elegant way, I created a function list fun_list through a function factory.

I tried the following and got the error:

df %>% 
  group_by(group) %>% 
  summarise_at(vars(V1,V2,V3), fun_list, threshold = threshold)

Error in list2(...) : object 'threshold' not found

My issue comes from the threshold variable. I can't find a way to use the function list I build and tell R that the threshold argument has to be taken from each data group. I tried moving the threshold variable to a parameter of the function factory and to build the function list inside summarise_at through a purrr::map call, but I get the same issue. Essentially, the manipulations I make always somehow make R leave the right environment to evaluate threshold by group. Using .$threshold returns the threshold variable for the entire data, so that does not work.

However, the fact that the following code works (but only for a given value of n at a time) makes me think that there is a way to evaluate threshold correctly.

n <- 1
df %>% 
  group_by(group) %>% 
  summarise_at(vars(V1,V2,V3), ~ sum((. * (threshold >= 1/2))^n))

Any ideas?

回答1:

I found a way to have threshold being evaluated in the right environment (grouped data) when written as an additional argument to summarise_at functions: you need to quote threshold with quo.

df %>% 
  group_by(group) %>% 
  summarise_at(vars(V1,V2,V3), fun_list, threshold = quo(threshold))

I'm not 100% of my understanding. I think that quoting makes sure that threshold will be evaluated using the environment it was found in at the time quo was called, which is the grouped data (what we want). Essentially, quoting a variable makes it carry not only its name, but also sets a reference to the environment we want that variable to be evaluated with. Without quoting, threshold's evaluation was attempted in a different environment (not sure which one...) where the variable does not exist. General information about programming in dplyr can be found here.

Please let me know if this solution still has something wrong / not robust.

来源：https://stackoverflow.com/questions/59185751/dplyr-summarise-with-list-of-function-and-dependence-on-other-data-column

标签

group-by

dplyr

summarize