bunch recoding of variables in the tidyverse (functional / meta-programing)

I want to recode a bunch of variables with as few function calls as possible. I have one data.frame where I want to recode a number of variables. I create a named list of all variable names and the recoding arguments I want to execute. Here I have no problem using map and dpylr. However, when it comes to recoding I find it much easier using recode from the car package, instead of dpylr's own recoding function. A side question is whether there is a nice way of doing the same thing with dplyr::recode.

As a next step I break the data.frame down into a nested tibble. Here I want to do specific recodings in each subset. This is where things get complicated and I am not able to do this in a dpylr pipe anymore. The only thing I get working is a very ugly nested for loop.

Looking for ideas to do this in a nice and clean way.

Lets start with the easy example:

library(carData)
library(dplyr)
library(purrr)
library(tidyr)

# global recode list
recode_ls = list(

  mar = "'not married' = 0;
          'married' = 1",

  wexp = "'no' = 0;
          'yes' = 1"
)

recode_vars <- names(Rossi)[names(Rossi) %in% names(recode_ls)]

Rossi2 <- Rossi # lets save results under a different name

Rossi2[,recode_vars] <- recode_vars %>% map(~ car::recode(Rossi[[.x]],
                                                          recode_ls[.x],
                                                          as.factor = FALSE,
                                                          as.numeric = TRUE))

So far this seems pretty clean to me, apart from the fact that car::recode is much easier to use than dplyr::recode.

Here comes my actual problem. What I am trying to do is recode (in this easy example) the variables mar and wexp differently in each tibble subset. In my real data set the variables I want to recode in each subset are many more and have different names too. Does anyone have a good idea how to do this nice and clean using a dpylr pipe and map?

    nested_rossi <- as_tibble(Rossi) %>% nest(-race)

    recode_wexp_ls = list(

      no = list(

      mar = "'not married' = 0;
             'married' = 1",

      wexp = "'no' = 0;
              'yes' = 1"
      ),

      yes = list(
        mar = "'not married' = 1;
               'married' = 2",

        wexp = "'no' = 1;
                'yes' = 2"
      )

We could also attach the list to the nested data.frame, but I'm not sure if this would make things more efficient.

nested_rossi$recode = list(

          no = list(

          mar = "'not married' = 0;
                 'married' = 1",

          wexp = "'no' = 0;
                  'yes' = 1"
          ),

          yes = list(
            mar = "'not married' = 1;
                   'married' = 2",

            wexp = "'no' = 1;
                    'yes' = 2"
          )
        )

Thanks for a cool question! This is a great chance to use all the power of metaprogramming.

First, let's examine the recode() function. It gets a vector and an arbitrary number of (named) arguments and returns the same vector with values replaced with function arguments:

x <- c("a", "b", "c")
recode(x, a = "Z", c = "X")

#> [1] "Z" "b" "X"

recode's help says that we can use unquote splicing (!!!) to pass a named list into it.

x_codes <- list(a = "Z", c = "X")
recode(x, !!!x_codes)

#> [1] "Z" "b" "X"

This ability may be used when mutating a data frame. Suggesting, we have a subset of Rossi dataset:

library(carData)
library(tidyverse)

rossi <- Rossi %>% 
  as_tibble() %>% 
  select(mar, wexp)

To mutate two variables in a single function call we can use this snippet (note that both named arguments and unquote splicing approaches work well):

mar_codes <- list(`not married` = 0, married = 1)
wexp_codes <- list(no = 0, yes = 1)

rossi %>% 
  mutate(
    mar_code = recode(mar, "not married" = 0, "married" = 1),
    wexp_code = recode(wexp, !!!wexp_codes)
  )

#> # A tibble: 432 x 4
#>    mar         wexp  mar_code wexp_code
#>    <fct>       <fct>    <dbl>     <dbl>
#>  1 not married no           0         0
#>  2 not married no           0         0
#>  3 not married yes          0         1
#>  4 married     yes          1         1
#>  5 not married yes          0         1

So, unquote splicing is a good method to pass multiple arguments into a function in a non-standard evaluation environment.

Now suggest we have a list of lists of codes:

mapping <- list(mar = mar_codes, wexp = wexp_codes)
mapping

#> $mar
#> $mar$`not married`
#> [1] 0

#> $mar$married
#> [1] 1

#> $wexp
#> $wexp$no
#> [1] 0

#> $wexp$yes
#> [1] 1

What we need is to transform this list to list of expressions to place inside mutate():

expressions <- mapping %>% 
  imap(
    ~ quo(
      recode(!!sym(.y), !!!.x)
    )
  )

expressions

#> $mar
#> <quosure>
#> expr: ^recode(mar, not married = 0, married = 1)
#> env:  0x7fbf374513c0

#> $wexp
#> <quosure>
#> expr: ^recode(wexp, no = 0, yes = 1)
#> env:  0x7fbf37453468

The last step. Pass this list of expressions inside the mutate and see what it will do:

mutate(rossi, !!!expressions)

#> # A tibble: 432 x 2
#>      mar  wexp
#>    <dbl> <dbl>
#>  1     0     0
#>  2     0     0
#>  3     0     1
#>  4     1     1
#>  5     0     1

Now you can widen your lists of variables to recode, handle several lists at once and so on.

With such a powerful technique (metaprogramming) you can do amazing things. I strongly recommend you delve into this theme. And there is no better resource to start than Hadley Wickham's Advanced R book.

Hope, it's what you have been looking for.

Update

Diving deeper. The question was: how to apply this technique to a tibble-column?

Let's create nested tibble of group and df (our data to recode)

rossi <- 
  head(Rossi, 5) %>% 
  as_tibble() %>% 
  select(mar, wexp)

nested <- tibble(group = c("yes", "no"), df = list(rossi))

nested looks like:

# A tibble: 2 x 2
  group df              
  <chr> <list>          
1 yes   <tibble [5 × 2]>
2 no    <tibble [5 × 2]>

We already know how to build a list of expressions from the list of codes. Let's create a function to handle it for us.

build_recode_expressions <- function(list_of_codes) {
  imap(list_of_codes, ~ quo(recode(!!sym(.y), !!!.x)))
}

There, list_of_codes argument is a named list for each variable needed to recode.

Assuming, we have a list of multiple recodings codes, we can transform it into the list of multiple lists of expressions. The number of variables in each list may be arbitrary.

codes <- list(
  yes = list(mar = list(`not married` = 0, married = 1)),
  no = list(
    mar = list(`not married` = 10, married = 20), 
    wexp = list(no = "NOOOO", yes = "YEEEES")
  )
)

exprs <- map(codes, build_recode_expressions)

Now we can easily add exprs into the nested data frame as new list-column.

There is another function may be useful for further work. This function takes a data frame and a list of quoted expressions and returns a new data frame with recoded columns.

recode_df <- function(df, exprs) mutate(df, !!!exprs)

It's time to combine all together. We have tibble-column df, list-column exprs and function recode_df that binds them together but one by one.

The clue is map2 function. It allows us to iterate over two lists simultaneously:

nested %>% 
  mutate(exprs = exprs) %>% 
  mutate(df_recoded = map2(df, exprs, recode_df)) %>% 
  unnest(df, df_recoded)

And this is the output:

# A tibble: 10 x 5
   group mar         wexp   mar1 wexp1 
   <chr> <fct>       <fct> <dbl> <chr> 
 1 yes   not married no        0 no    
 2 yes   not married no        0 no    
 3 yes   not married yes       0 yes   
 4 yes   married     yes       1 yes   
 5 yes   not married yes       0 yes   
 6 no    not married no       10 NOOOO 
 7 no    not married no       10 NOOOO 
 8 no    not married yes      10 YEEEES
 9 no    married     yes      20 YEEEES
10 no    not married yes      10 YEEEES

I hope this update will solve your problem.

来源：https://stackoverflow.com/questions/56636417/bunch-recoding-of-variables-in-the-tidyverse-functional-meta-programing

标签

tidyverse

purrr

recode