问题
Sorry for the terrible title, but it's hard to explain. I have the following data and functions I want to summarize the data with:
library(tidyverse)
# generate data
df <- map(1:4, ~ runif(100)) %>%
set_names(c(paste0('V', 1:3), 'threshold')) %>%
as_tibble() %>%
mutate(group = sample(c('a', 'b'), 100, replace = T))
# generate function list
fun_factory_params <- 1:10
fun_factory <- function(param){
function(v, threshold){
sum((v * (threshold >= 1/2))^param)
}
}
fun_list <- map(fun_factory_params, fun_factory)
df %>% head(n = 5)
V1 V2 V3 threshold group
<dbl> <dbl> <dbl> <dbl> <chr>
1 0.631 0.0209 0.0360 0.713 b
2 0.629 0.674 0.174 0.693 b
3 0.144 0.358 0.439 0.395 a
4 0.0695 0.760 0.657 0.810 a
5 0.545 0.770 0.719 0.388 b
I want to group df
by the group
variable and summarize V1
, V2
and V3
in the following way: for each V
of those variables and each value n
in fun_factory_params
(1 to 10), I want to compute sum((V * (threshold >= 1/2))^n)
. To have results computed for each n
in an elegant way, I created a function list fun_list
through a function factory.
I tried the following and got the error:
df %>%
group_by(group) %>%
summarise_at(vars(V1,V2,V3), fun_list, threshold = threshold)
Error in list2(...) : object 'threshold' not found
My issue comes from the threshold
variable. I can't find a way to use the function list I build and tell R that the threshold argument has to be taken from each data group. I tried moving the threshold variable to a parameter of the function factory and to build the function list inside summarise_at
through a purrr::map
call, but I get the same issue. Essentially, the manipulations I make always somehow make R leave the right environment to evaluate threshold by group. Using .$threshold
returns the threshold variable for the entire data, so that does not work.
However, the fact that the following code works (but only for a given value of n at a time) makes me think that there is a way to evaluate threshold correctly.
n <- 1
df %>%
group_by(group) %>%
summarise_at(vars(V1,V2,V3), ~ sum((. * (threshold >= 1/2))^n))
Any ideas?
回答1:
I found a way to have threshold
being evaluated in the right environment (grouped data) when written as an additional argument to summarise_at
functions: you need to quote threshold
with quo
.
df %>%
group_by(group) %>%
summarise_at(vars(V1,V2,V3), fun_list, threshold = quo(threshold))
I'm not 100% of my understanding. I think that quoting makes sure that threshold will be evaluated using the environment it was found in at the time quo
was called, which is the grouped data (what we want). Essentially, quoting a variable makes it carry not only its name, but also sets a reference to the environment we want that variable to be evaluated with. Without quoting, threshold
's evaluation was attempted in a different environment (not sure which one...) where the variable does not exist. General information about programming in dplyr
can be found here.
Please let me know if this solution still has something wrong / not robust.
来源:https://stackoverflow.com/questions/59185751/dplyr-summarise-with-list-of-function-and-dependence-on-other-data-column