问题
This question is related to a previous one I asked, but trying to be more generic. I want to use formulas to perform operations on multiple "groups" of data (i.e. a_data1
, a_data2
, b_data1
, b_data2
, and then make operations using the *_data1
columns).
Based on @akrun's answer to that question, I created the following function. It takes a one-sided formula and applies it to all the "groups of data":
suppressPackageStartupMessages({
library(dplyr)
library(tidyr)
})
polymutate <- function(df, formula,
pattern = "(.)_(.*)",
staticCols = NULL) {
staticCols <- rlang::enquo(staticCols)
rhs <- rlang::f_rhs(formula)
names <- all.vars(rhs)
df %>%
mutate(
rn = row_number()
) %>%
pivot_longer(
cols = -c(rn, !!staticCols),
names_to = c(".value", "grp"),
names_pattern = pattern
) %>%
mutate(
new = eval(rhs)
) %>%
pivot_wider(
names_from = grp,
values_from = c(names, "new")
) %>%
select(
-rn
) %>%
rename_at(
vars(starts_with("new")),
gsub, pattern = "^new_", replacement = ""
)
}
df <- data.frame(a_data1 = 1:3, b_data1 = 2:4,
a_data2 = 3:5, b_data2 = 4:6,
static = 5:7)
polymutate(df, ~ a + b, staticCols = static)
#> # A tibble: 3 x 7
#> static a_data1 a_data2 b_data1 b_data2 data1 data2
#> <int> <int> <int> <int> <int> <int> <int>
#> 1 5 1 3 2 4 3 7
#> 2 6 2 4 3 5 5 9
#> 3 7 3 5 4 6 7 11
Created on 2020-03-13 by the reprex package (v0.3.0)
So, this polymutate
converts the dataframe into a longer format such that we have one column with the group name (data1
or data2
) and one per prefix (a
and b
). It then evaluates the given formula in the context of this deeper dataframe (obviously the names in the formula must match the prefixes). Once that's done, it widens the dataframe back to its original shape.
This works quite well, but it's a bit slow. Using it on a dataframe with 20,000 rows and 11 "groups" takes 0.77 seconds.
I figured that was due to the need to restructure such a large dataframe twice: deepening and then widening it.
So I wondered if I could do this without that hassle. I found the wrapr
package, which allows us to create aliases for names. I should therefore be able to perform something similar to the above, passing the formula and the names of the columns I want to change.
It could then extract the variables used in the formula and use them to rebuild the desired column names, create the alias mapping, and then use that mapping to apply the formula to the dataframe. I got quite close, but couldn't get the actual formula to be evaluated:
suppressPackageStartupMessages({
library(dplyr)
})
polymutate2 <- function(df, formula, name) {
vars <- all.vars(formula)
rhs <- rlang::f_rhs(formula)
aliases <- paste0(vars, "_", name)
mapping <- rlang::list2(!!!aliases)
names(mapping) <- vars
mapping <- do.call(wrapr::qc, mapping)
wrapr::let(
mapping,
df %>% mutate(!!name := a + b)
)
}
df <- data.frame(a_data1 = 1:3, b_data1 = 2:4,
a_data2 = 3:5, b_data2 = 4:6,
static = 5:7)
polymutate2(df, ~ a + b, "data1")
#> a_data1 b_data1 a_data2 b_data2 static data1
#> 1 1 2 3 4 5 3
#> 2 2 3 4 5 6 5
#> 3 3 4 5 6 7 7
Created on 2020-03-13 by the reprex package (v0.3.0)
You'll notice the mutate
call has a hard-coded expression, since I couldn't get it to work with the given formula. Replacing that expression with eval(rhs)
as in the previous version throws an object 'a' not found
error:
suppressPackageStartupMessages({
library(dplyr)
# library(tidyr)
})
polymutate2 <- function(df, formula, name) {
vars <- all.vars(formula)
rhs <- rlang::f_rhs(formula)
aliases <- paste0(vars, "_", name)
mapping <- rlang::list2(!!!aliases)
names(mapping) <- vars
mapping <- do.call(wrapr::qc, mapping)
wrapr::let(
mapping,
df %>% mutate(!!name := eval(rhs))
)
}
polymutate2(df, ~ a + b, "data1")
#> Error in eval(rhs): object 'a' not found
If I can get this to work (and assuming the solution doesn't dramatically harm performance), it's much faster: it only takes 0.03 seconds to run a chain of polymutate2
's (one for each of the 11 groups in my 20,000 row dataframe).
So, how can I get polymutate2
to work with any formula? I'm open to any sort of suggestion, no need to use wrapr
if some other solution exists. (I'm also concerned this solution might not work if the formula is complex, calling functions or whatnot, just haven't managed to check yet).
回答1:
Maybe someone more knowledgeable can chime in with a more tidyverse-y approach, but the problem can be solved (not very elegantly, admittedly) by wrapping the entire wrapr::let call into eval(parse(text=..))
- it is definitely faster:
suppressPackageStartupMessages({
invisible(lapply(c("dplyr", "tidyr", "rlang", "wrapr", "microbenchmark"),
require, character.only = TRUE))
})
polymutate <- function(df, formula,
pattern = "(.)_(.*)",
staticCols = NULL) {
staticCols <- rlang::enquo(staticCols)
rhs <- rlang::f_rhs(formula)
names <- all.vars(rhs)
df %>%
mutate(
rn = row_number()
) %>%
pivot_longer(
cols = -c(rn, !!staticCols),
names_to = c(".value", "grp"),
names_pattern = pattern
) %>%
mutate(
new = eval(rhs)
) %>%
pivot_wider(
names_from = grp,
values_from = c(names, "new")
) %>%
select(
-rn
) %>%
rename_at(
vars(starts_with("new")),
gsub, pattern = "^new_", replacement = ""
)
}
polymutate2 <- function(df, formula, name) {
vars <- all.vars(formula)
rhs <- deparse(rlang::f_rhs(formula))
aliases <- paste0(vars, "_", name)
mapping <- rlang::list2(!!!aliases)
names(mapping) <- vars
mapping <- do.call(wrapr::qc, mapping)
eval(parse(text=paste0("wrapr::let(mapping, df %>% mutate(!!name := ", rhs, "))" ))
)
}
set.seed(1)
df <- setNames(data.frame(matrix(sample(1:12, 7E6, replace=TRUE), ncol=7)),
c("a_data1", "b_data1", "a_data2", "b_data2", "a_data3", "b_data3", "static"))
pd <- polymutate(df, ~ a + b, staticCols = static)
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(names)` instead of `names` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.
pd2 <- polymutate2(df, ~ a + b, "data1") %>% polymutate2(., ~ a + b, "data2") %>% polymutate2(., ~ a + b, "data3") %>% dplyr::select(static, everything()) %>%
as_tibble()
all.equal(pd, pd2)
#> [1] TRUE
microbenchmark(polymutate(df, ~ a + b, staticCols = static),
polymutate2(df, ~ a + b, "data1") %>% polymutate2(., ~ a + b, "data2") %>% polymutate2(., ~ a + b, "data3") %>% dplyr::select(static, everything()) %>%
as_tibble(),
times=10L)
#> Unit: milliseconds
#> expr
#> polymutate(df, ~a + b, staticCols = static)
#> polymutate2(df, ~a + b, "data1") %>% polymutate2(., ~a + b, "data2") %>% polymutate2(., ~a + b, "data3") %>% dplyr::select(static, everything()) %>% as_tibble()
#> min lq mean median uq max neval cld
#> 1143.582663 1151.206750 1171.46502 1173.03649 1188.91108 1209.01984 10 b
#> 9.553352 9.619473 10.88463 10.59397 12.27675 12.52403 10 a
Created on 2020-03-14 by the reprex package (v0.3.0)
来源:https://stackoverflow.com/questions/60677042/using-formulas-with-aliases-to-perform-multi-column-operations