Dplyr summarise with list of function and dependence on other data column

假装没事ソ 提交于 2020-01-15 10:15:30

问题


Sorry for the terrible title, but it's hard to explain. I have the following data and functions I want to summarize the data with:

library(tidyverse)

# generate data
df <- map(1:4, ~ runif(100)) %>% 
  set_names(c(paste0('V', 1:3), 'threshold')) %>% 
  as_tibble() %>% 
  mutate(group = sample(c('a', 'b'), 100, replace = T))

# generate function list
fun_factory_params <- 1:10
fun_factory <- function(param){
  function(v, threshold){
    sum((v * (threshold >= 1/2))^param)
  }
}
fun_list <- map(fun_factory_params, fun_factory)

df %>% head(n = 5)
      V1     V2     V3 threshold group
   <dbl>  <dbl>  <dbl>     <dbl> <chr>
1 0.631  0.0209 0.0360     0.713 b    
2 0.629  0.674  0.174      0.693 b    
3 0.144  0.358  0.439      0.395 a    
4 0.0695 0.760  0.657      0.810 a    
5 0.545  0.770  0.719      0.388 b    

I want to group df by the group variable and summarize V1, V2 and V3 in the following way: for each V of those variables and each value n in fun_factory_params (1 to 10), I want to compute sum((V * (threshold >= 1/2))^n). To have results computed for each n in an elegant way, I created a function list fun_list through a function factory.

I tried the following and got the error:

df %>% 
  group_by(group) %>% 
  summarise_at(vars(V1,V2,V3), fun_list, threshold = threshold)

Error in list2(...) : object 'threshold' not found

My issue comes from the threshold variable. I can't find a way to use the function list I build and tell R that the threshold argument has to be taken from each data group. I tried moving the threshold variable to a parameter of the function factory and to build the function list inside summarise_at through a purrr::map call, but I get the same issue. Essentially, the manipulations I make always somehow make R leave the right environment to evaluate threshold by group. Using .$threshold returns the threshold variable for the entire data, so that does not work.

However, the fact that the following code works (but only for a given value of n at a time) makes me think that there is a way to evaluate threshold correctly.

n <- 1
df %>% 
  group_by(group) %>% 
  summarise_at(vars(V1,V2,V3), ~ sum((. * (threshold >= 1/2))^n))

Any ideas?


回答1:


I found a way to have threshold being evaluated in the right environment (grouped data) when written as an additional argument to summarise_at functions: you need to quote threshold with quo.

df %>% 
  group_by(group) %>% 
  summarise_at(vars(V1,V2,V3), fun_list, threshold = quo(threshold))

I'm not 100% of my understanding. I think that quoting makes sure that threshold will be evaluated using the environment it was found in at the time quo was called, which is the grouped data (what we want). Essentially, quoting a variable makes it carry not only its name, but also sets a reference to the environment we want that variable to be evaluated with. Without quoting, threshold's evaluation was attempted in a different environment (not sure which one...) where the variable does not exist. General information about programming in dplyr can be found here.

Please let me know if this solution still has something wrong / not robust.



来源:https://stackoverflow.com/questions/59185751/dplyr-summarise-with-list-of-function-and-dependence-on-other-data-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!