R: Create Indicator Columns from list of conditions

问题

I have a dataframe and a number of conditions. Each condition is supposed to check whether the value in a certain column of the dataframe is within a set of valid values.

This is what I tried:

# create the sample dataframe
age <- c(120, 45)
sex <- c("x", "f")

df <-data.frame(age, sex)

# create the sample conditions
conditions <- list(
  list("age", c(18:100)),
  list("sex", c("f", "m"))
)

addIndicator <- function (df, columnName, validValues) {
  indicator <- vector()

  for (row in df[, toString(columnName)]) {
    # for some strange reason, %in% doesn't work correctly here, but always returns FALSe
    indicator <- append(indicator, row %in% validValues)
  }
  df <- cbind(df, indicator)

  # rename the column
  names(df)[length(names(df))] <- paste0("I_", columnName)

  return(df)
}

for (condition in conditions){
  columnName <- condition[1]
  validValues <- condition[2]
  df <- addIndicator(df, columnName, validValues)
}

print(df)

However, this leads to all conditions considered not to be met - which is not what I expect:

  age sex I_age I_sex
1 120   x FALSE FALSE
2  45   f FALSE FALSE

I figured that %in% does not return the expected result. I checked for the typeof(row) and tried to boil this down into a minimum example. In a simple ME, with the same type and values of the variables, the %in% works properly. So, something must be wrong within the context I try to apply this. Since this is my first attempt to write anything in R, I am stuck here.

What am I doing wrong and how can I achieve what I want?

回答1:

If you prefer an approach that uses the tidyverse family of packages:

library(tidyverse)

allowed_values <- list(age = 18:100, sex = c("f", "m"))

df %>%
  imap_dfr(~ .x %in% allowed_values[[.y]]) %>%
  rename_with(~ paste0('I_', .x)) %>%
  bind_cols(df)

imap_dfr allows you to manipulate each column in df using a lambda function. .x references the column content and .y references the name.

rename_with renames the columns using another lambda function and bind_cols combines the results with the original dataframe.

I borrowed the simplified list of conditions from ben's answer. I find my approach slightly more readable but that is a matter of taste and of whether you are already using the tidyverse elsewhere.

回答2:

conditions appears to be a nested list. When you use:

validValues <- condition[2]

in your for loop, your result is also a list.

To get the vector of values to use with %in%, you can extract [[ by:

validValues <- condition[[2]]

A simplified approach to obtaining indicators could be with a simple list:

conditions_lst <- list(age = 18:100, sex = c("f", "m"))

And using sapply instead of a for loop:

cbind(df, sapply(setNames(names(df), paste("I", names(df), sep = "_")), function(x) {
  df[[x]] %in% conditions_lst[[x]]
}))

Output

  age sex I_age I_sex
1 120   x FALSE FALSE
2  45   f  TRUE  TRUE

来源：https://stackoverflow.com/questions/62151711/r-create-indicator-columns-from-list-of-conditions

标签

indicator