R - tidyr - spread() - dealing with NA as column name

孤街醉人 提交于 2019-12-24 16:03:04

问题


I am spreading multiple categorical variables to Boolean columns using tidyr::spread(). As the data contains NAs, spread creates a new column without a name.

What I'm looking for is a way to get rid off the NAs using

a) a piping solution (I've tried select_() and '['(), but don't know how to refer to the NA column's name or index) or

b) a custom function, which would be even better

c) a way to simply not generate the NA columns, Hadleyverse compatible, if possible.

Below is my current (and very inelegantly repetitive) solution.

library(tidyr)
library(dplyr)

test <- data.frame(id = 1:4, name = c("anna", "bert", "charles", "daniel"),
                   flower = as.factor(c("rose", "rose", NA, "petunia")),
                   music = as.factor(c("pop","classical", "rock", NA)),
                   degree = as.factor(c(NA, "PhD", "MSc", "MSc")))

test <- test %>% 
  mutate(truval = TRUE) %>% 
  spread(key = flower, value = truval, fill = FALSE)
test[ncol(test)] <- NULL

test <- test %>% 
  mutate(truval = TRUE) %>% 
  spread(key = music, value = truval, fill = FALSE)
test[ncol(test)] <- NULL

test <- test %>% 
  mutate(truval = TRUE) %>% 
  spread(key = degree, value = truval, fill = FALSE)
test[ncol(test)] <- NULL

test

回答1:


We can use select with backquotes for the "NA" column.

 test %>% 
    mutate(truval= TRUE) %>% 
    spread(flower, truval, fill=FALSE) %>% 
    select(-`NA`)
 #  id    name     music degree petunia  rose
 #1  1    anna       pop   <NA>   FALSE  TRUE
 #2  2    bert classical    PhD   FALSE  TRUE
 #3  3 charles      rock    MSc   FALSE FALSE
 #4  4  daniel      <NA>    MSc    TRUE FALSE

I guess it is difficult to not generate the NA column as the observations in other columns are tied to it. We could use filter with is.na to remove the row that has 'NA' in the 'flower' column, but then we will lose one row ie. the 3rd row.




回答2:


As per @akrun's response, you can use refer to NA with backquotes. And here is a function to take care of it:

Spread_bool <- function(df, varname) {
# spread a categorical variable to Boolean columns, remove NA column
# Input:
#  df: a data frame containing the variable to be spread
#  varname: the "quoted" name of the variable to be spread
#
# Return:
#  df: a data frame with the variable spread to columns

  df <- df %>% 
    mutate(truval = TRUE) %>% 
    spread_(varname, "truval", fill = FALSE) %>% 
    select(-`NA`)

  df

}


来源:https://stackoverflow.com/questions/33191137/r-tidyr-spread-dealing-with-na-as-column-name

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!