问题
I am spreading multiple categorical variables to Boolean columns using tidyr::spread()
. As the data contains NAs, spread
creates a new column without a name.
What I'm looking for is a way to get rid off the NAs using
a) a piping solution (I've tried select_()
and '['()
, but don't know how to refer to the NA column's name or index) or
b) a custom function, which would be even better
c) a way to simply not generate the NA columns, Hadleyverse compatible, if possible.
Below is my current (and very inelegantly repetitive) solution.
library(tidyr)
library(dplyr)
test <- data.frame(id = 1:4, name = c("anna", "bert", "charles", "daniel"),
flower = as.factor(c("rose", "rose", NA, "petunia")),
music = as.factor(c("pop","classical", "rock", NA)),
degree = as.factor(c(NA, "PhD", "MSc", "MSc")))
test <- test %>%
mutate(truval = TRUE) %>%
spread(key = flower, value = truval, fill = FALSE)
test[ncol(test)] <- NULL
test <- test %>%
mutate(truval = TRUE) %>%
spread(key = music, value = truval, fill = FALSE)
test[ncol(test)] <- NULL
test <- test %>%
mutate(truval = TRUE) %>%
spread(key = degree, value = truval, fill = FALSE)
test[ncol(test)] <- NULL
test
回答1:
We can use select
with backquotes
for the "NA" column.
test %>%
mutate(truval= TRUE) %>%
spread(flower, truval, fill=FALSE) %>%
select(-`NA`)
# id name music degree petunia rose
#1 1 anna pop <NA> FALSE TRUE
#2 2 bert classical PhD FALSE TRUE
#3 3 charles rock MSc FALSE FALSE
#4 4 daniel <NA> MSc TRUE FALSE
I guess it is difficult to not generate the NA column as the observations in other columns are tied to it. We could use filter
with is.na
to remove the row that has 'NA' in the 'flower' column, but then we will lose one row ie. the 3rd row.
回答2:
As per @akrun's response, you can use refer to NA with backquotes. And here is a function to take care of it:
Spread_bool <- function(df, varname) {
# spread a categorical variable to Boolean columns, remove NA column
# Input:
# df: a data frame containing the variable to be spread
# varname: the "quoted" name of the variable to be spread
#
# Return:
# df: a data frame with the variable spread to columns
df <- df %>%
mutate(truval = TRUE) %>%
spread_(varname, "truval", fill = FALSE) %>%
select(-`NA`)
df
}
来源:https://stackoverflow.com/questions/33191137/r-tidyr-spread-dealing-with-na-as-column-name