R - tidyr - spread() - dealing with NA as column name

问题

I am spreading multiple categorical variables to Boolean columns using tidyr::spread(). As the data contains NAs, spread creates a new column without a name.

What I'm looking for is a way to get rid off the NAs using

a) a piping solution (I've tried select_() and '['(), but don't know how to refer to the NA column's name or index) or

b) a custom function, which would be even better

c) a way to simply not generate the NA columns, Hadleyverse compatible, if possible.

Below is my current (and very inelegantly repetitive) solution.

library(tidyr)
library(dplyr)

test <- data.frame(id = 1:4, name = c("anna", "bert", "charles", "daniel"),
                   flower = as.factor(c("rose", "rose", NA, "petunia")),
                   music = as.factor(c("pop","classical", "rock", NA)),
                   degree = as.factor(c(NA, "PhD", "MSc", "MSc")))

test <- test %>% 
  mutate(truval = TRUE) %>% 
  spread(key = flower, value = truval, fill = FALSE)
test[ncol(test)] <- NULL

test <- test %>% 
  mutate(truval = TRUE) %>% 
  spread(key = music, value = truval, fill = FALSE)
test[ncol(test)] <- NULL

test <- test %>% 
  mutate(truval = TRUE) %>% 
  spread(key = degree, value = truval, fill = FALSE)
test[ncol(test)] <- NULL

test

回答1:

We can use select with backquotes for the "NA" column.

 test %>% 
    mutate(truval= TRUE) %>% 
    spread(flower, truval, fill=FALSE) %>% 
    select(-`NA`)
 #  id    name     music degree petunia  rose
 #1  1    anna       pop   <NA>   FALSE  TRUE
 #2  2    bert classical    PhD   FALSE  TRUE
 #3  3 charles      rock    MSc   FALSE FALSE
 #4  4  daniel      <NA>    MSc    TRUE FALSE

I guess it is difficult to not generate the NA column as the observations in other columns are tied to it. We could use filter with is.na to remove the row that has 'NA' in the 'flower' column, but then we will lose one row ie. the 3rd row.

回答2:

As per @akrun's response, you can use refer to NA with backquotes. And here is a function to take care of it:

Spread_bool <- function(df, varname) {
# spread a categorical variable to Boolean columns, remove NA column
# Input:
#  df: a data frame containing the variable to be spread
#  varname: the "quoted" name of the variable to be spread
#
# Return:
#  df: a data frame with the variable spread to columns

  df <- df %>% 
    mutate(truval = TRUE) %>% 
    spread_(varname, "truval", fill = FALSE) %>% 
    select(-`NA`)

  df

}

来源：https://stackoverflow.com/questions/33191137/r-tidyr-spread-dealing-with-na-as-column-name

标签

piping

tidyr

spread