Using formulas with aliases to perform multi-column operations

问题

This question is related to a previous one I asked, but trying to be more generic. I want to use formulas to perform operations on multiple "groups" of data (i.e. a_data1, a_data2, b_data1, b_data2, and then make operations using the *_data1 columns).

Based on @akrun's answer to that question, I created the following function. It takes a one-sided formula and applies it to all the "groups of data":

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
})

polymutate <- function(df, formula,
                       pattern = "(.)_(.*)",
                       staticCols = NULL) {
  staticCols <- rlang::enquo(staticCols)

  rhs <- rlang::f_rhs(formula)
  names <- all.vars(rhs)
  df %>%
    mutate(
      rn = row_number()
    ) %>%
    pivot_longer(
      cols = -c(rn, !!staticCols),
      names_to = c(".value", "grp"),
      names_pattern = pattern
    ) %>%
    mutate(
      new = eval(rhs)
    ) %>%
    pivot_wider(
      names_from = grp,
      values_from = c(names, "new")
    ) %>%
    select(
      -rn
    ) %>%
    rename_at(
      vars(starts_with("new")),
      gsub, pattern = "^new_", replacement = ""
    )
}

df <- data.frame(a_data1 = 1:3, b_data1 = 2:4,
                 a_data2 = 3:5, b_data2 = 4:6,
                 static = 5:7)

polymutate(df, ~ a + b, staticCols = static)
#> # A tibble: 3 x 7
#>   static a_data1 a_data2 b_data1 b_data2 data1 data2
#>    <int>   <int>   <int>   <int>   <int> <int> <int>
#> 1      5       1       3       2       4     3     7
#> 2      6       2       4       3       5     5     9
#> 3      7       3       5       4       6     7    11

^{Created on 2020-03-13 by the reprex package (v0.3.0)}

So, this polymutate converts the dataframe into a longer format such that we have one column with the group name (data1 or data2) and one per prefix (a and b). It then evaluates the given formula in the context of this deeper dataframe (obviously the names in the formula must match the prefixes). Once that's done, it widens the dataframe back to its original shape.

This works quite well, but it's a bit slow. Using it on a dataframe with 20,000 rows and 11 "groups" takes 0.77 seconds.

I figured that was due to the need to restructure such a large dataframe twice: deepening and then widening it.

So I wondered if I could do this without that hassle. I found the wrapr package, which allows us to create aliases for names. I should therefore be able to perform something similar to the above, passing the formula and the names of the columns I want to change.

It could then extract the variables used in the formula and use them to rebuild the desired column names, create the alias mapping, and then use that mapping to apply the formula to the dataframe. I got quite close, but couldn't get the actual formula to be evaluated:

suppressPackageStartupMessages({
  library(dplyr)
})

polymutate2 <- function(df, formula, name) {
  vars <- all.vars(formula)
  rhs <- rlang::f_rhs(formula)
  aliases <- paste0(vars, "_", name)
  mapping <- rlang::list2(!!!aliases)
  names(mapping) <- vars

  mapping <- do.call(wrapr::qc, mapping)
  wrapr::let(
    mapping,
    df %>% mutate(!!name := a + b)
  )
}

df <- data.frame(a_data1 = 1:3, b_data1 = 2:4,
                 a_data2 = 3:5, b_data2 = 4:6,
                 static = 5:7)

polymutate2(df, ~ a + b, "data1")
#>   a_data1 b_data1 a_data2 b_data2 static data1
#> 1       1       2       3       4      5     3
#> 2       2       3       4       5      6     5
#> 3       3       4       5       6      7     7

^{Created on 2020-03-13 by the reprex package (v0.3.0)}

You'll notice the mutate call has a hard-coded expression, since I couldn't get it to work with the given formula. Replacing that expression with eval(rhs) as in the previous version throws an object 'a' not found error:

suppressPackageStartupMessages({
  library(dplyr)
  # library(tidyr)
})

polymutate2 <- function(df, formula, name) {
  vars <- all.vars(formula)
  rhs <- rlang::f_rhs(formula)
  aliases <- paste0(vars, "_", name)
  mapping <- rlang::list2(!!!aliases)
  names(mapping) <- vars

  mapping <- do.call(wrapr::qc, mapping)
  wrapr::let(
    mapping,
    df %>% mutate(!!name := eval(rhs))
  )
}

polymutate2(df, ~ a + b, "data1")
#> Error in eval(rhs): object 'a' not found

If I can get this to work (and assuming the solution doesn't dramatically harm performance), it's much faster: it only takes 0.03 seconds to run a chain of polymutate2's (one for each of the 11 groups in my 20,000 row dataframe).

So, how can I get polymutate2 to work with any formula? I'm open to any sort of suggestion, no need to use wrapr if some other solution exists. (I'm also concerned this solution might not work if the formula is complex, calling functions or whatnot, just haven't managed to check yet).

回答1:

Maybe someone more knowledgeable can chime in with a more tidyverse-y approach, but the problem can be solved (not very elegantly, admittedly) by wrapping the entire wrapr::let call into eval(parse(text=..)) - it is definitely faster:


suppressPackageStartupMessages({
    invisible(lapply(c("dplyr", "tidyr", "rlang", "wrapr", "microbenchmark"),
                     require, character.only = TRUE))
})

polymutate <- function(df, formula,
                       pattern = "(.)_(.*)",
                       staticCols = NULL) {
    staticCols <- rlang::enquo(staticCols)

    rhs <- rlang::f_rhs(formula)
    names <- all.vars(rhs)
    df %>%
        mutate(
            rn = row_number()
        ) %>%
        pivot_longer(
            cols = -c(rn, !!staticCols),
            names_to = c(".value", "grp"),
            names_pattern = pattern
        ) %>%
        mutate(
            new = eval(rhs)
        ) %>%
        pivot_wider(
            names_from = grp,
            values_from = c(names, "new")
        ) %>%
        select(
            -rn
        ) %>%
        rename_at(
            vars(starts_with("new")),
            gsub, pattern = "^new_", replacement = ""
        )
}

polymutate2 <- function(df, formula, name) {
    vars <- all.vars(formula)
    rhs <- deparse(rlang::f_rhs(formula))
    aliases <- paste0(vars, "_", name)
    mapping <- rlang::list2(!!!aliases)
    names(mapping) <- vars
    mapping <- do.call(wrapr::qc, mapping)
    eval(parse(text=paste0("wrapr::let(mapping, df %>% mutate(!!name := ", rhs, "))" ))
    )
}

set.seed(1)                 
df <- setNames(data.frame(matrix(sample(1:12, 7E6, replace=TRUE), ncol=7)),
               c("a_data1", "b_data1", "a_data2", "b_data2", "a_data3", "b_data3", "static"))

pd <- polymutate(df, ~ a + b, staticCols = static)
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(names)` instead of `names` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.
pd2 <- polymutate2(df, ~ a + b, "data1") %>% polymutate2(., ~ a + b, "data2") %>% polymutate2(., ~ a + b, "data3") %>% dplyr::select(static, everything()) %>% 
    as_tibble()

all.equal(pd, pd2)
#> [1] TRUE

microbenchmark(polymutate(df, ~ a + b, staticCols = static), 
               polymutate2(df, ~ a + b, "data1") %>% polymutate2(., ~ a + b, "data2") %>% polymutate2(., ~ a + b, "data3") %>% dplyr::select(static, everything()) %>% 
                   as_tibble(),
               times=10L)
#> Unit: milliseconds
#>                                                                                                                                                                        expr
#>                                                                                                                                 polymutate(df, ~a + b, staticCols = static)
#>  polymutate2(df, ~a + b, "data1") %>% polymutate2(., ~a + b, "data2") %>%      polymutate2(., ~a + b, "data3") %>% dplyr::select(static,      everything()) %>% as_tibble()
#>          min          lq       mean     median         uq        max neval cld
#>  1143.582663 1151.206750 1171.46502 1173.03649 1188.91108 1209.01984    10   b
#>     9.553352    9.619473   10.88463   10.59397   12.27675   12.52403    10  a

^{Created on 2020-03-14 by the reprex package (v0.3.0)}

来源：https://stackoverflow.com/questions/60677042/using-formulas-with-aliases-to-perform-multi-column-operations

标签

dataframe

dplyr

rlang

wrapr