问题
I have a dataframe and a number of conditions. Each condition is supposed to check whether the value in a certain column of the dataframe is within a set of valid values.
This is what I tried:
# create the sample dataframe
age <- c(120, 45)
sex <- c("x", "f")
df <-data.frame(age, sex)
# create the sample conditions
conditions <- list(
list("age", c(18:100)),
list("sex", c("f", "m"))
)
addIndicator <- function (df, columnName, validValues) {
indicator <- vector()
for (row in df[, toString(columnName)]) {
# for some strange reason, %in% doesn't work correctly here, but always returns FALSe
indicator <- append(indicator, row %in% validValues)
}
df <- cbind(df, indicator)
# rename the column
names(df)[length(names(df))] <- paste0("I_", columnName)
return(df)
}
for (condition in conditions){
columnName <- condition[1]
validValues <- condition[2]
df <- addIndicator(df, columnName, validValues)
}
print(df)
However, this leads to all conditions considered not to be met - which is not what I expect:
age sex I_age I_sex
1 120 x FALSE FALSE
2 45 f FALSE FALSE
I figured that %in%
does not return the expected result. I checked for the typeof(row)
and tried to boil this down into a minimum example. In a simple ME, with the same type and values of the variables, the %in%
works properly. So, something must be wrong within the context I try to apply this. Since this is my first attempt to write anything in R, I am stuck here.
What am I doing wrong and how can I achieve what I want?
回答1:
If you prefer an approach that uses the tidyverse family of packages:
library(tidyverse)
allowed_values <- list(age = 18:100, sex = c("f", "m"))
df %>%
imap_dfr(~ .x %in% allowed_values[[.y]]) %>%
rename_with(~ paste0('I_', .x)) %>%
bind_cols(df)
imap_dfr
allows you to manipulate each column in df
using a lambda function. .x
references the column content and .y
references the name.
rename_with
renames the columns using another lambda function and bind_cols
combines the results with the original dataframe.
I borrowed the simplified list of conditions from ben's answer. I find my approach slightly more readable but that is a matter of taste and of whether you are already using the tidyverse elsewhere.
回答2:
conditions
appears to be a nested list. When you use:
validValues <- condition[2]
in your for
loop, your result is also a list.
To get the vector of values to use with %in%
, you can extract [[
by:
validValues <- condition[[2]]
A simplified approach to obtaining indicators could be with a simple list:
conditions_lst <- list(age = 18:100, sex = c("f", "m"))
And using sapply
instead of a for
loop:
cbind(df, sapply(setNames(names(df), paste("I", names(df), sep = "_")), function(x) {
df[[x]] %in% conditions_lst[[x]]
}))
Output
age sex I_age I_sex
1 120 x FALSE FALSE
2 45 f TRUE TRUE
来源:https://stackoverflow.com/questions/62151711/r-create-indicator-columns-from-list-of-conditions