问题

Data

I'm working with a data set resembling the data.frame generated below:

set.seed(1)
dta <- data.frame(observation = 1:20,
                  valueA = runif(n = 20),
                  valueB = runif(n = 20),
                  valueC = runif(n = 20),
                  valueD = runif(n = 20))
dta[2:5,3] <- NA
dta[2:10,4] <- NA
dta[7:20,5] <- NA

The columns have NA values with the last column having more than 60% of observations NAs.

> sapply(dta, function(x) {table(is.na(x))})
$observation

FALSE 
   20 

$valueA

FALSE 
   20 

$valueB

FALSE  TRUE 
   16     4 

$valueC

FALSE  TRUE 
   11     9 

$valueD

FALSE  TRUE 
    6    14

Problem

I would like to be able to remove this column in dplyr pipe line somehow passing it to the select argument.

Attempts

This can be easily done in base. For example to select columns with less than 50% NAs I can do:

dta[, colSums(is.na(dta)) < nrow(dta) / 2]

which produces:

> head(dta[, colSums(is.na(dta)) < nrow(dta) / 2], 2)
  observation    valueA    valueB    valueC
1           1 0.2655087 0.9347052 0.8209463
2           2 0.3721239        NA        NA

Task

I'm interested in achieving the same flexibility in dplyr pipe line:

Vectorize(require)(package = c("dplyr",         # Data manipulation
                               "magrittr"),     # Reverse pipe

char = TRUE)

dta %<>%
  # Some transformations I'm doing on the data
  mutate_each(funs(as.numeric)) %>% 
  # I want my select to take place here

回答1:

Like this perhaps?

dta %>% select(which(colMeans(is.na(.)) < 0.5)) %>% head
#  observation    valueA    valueB    valueC
#1           1 0.2655087 0.9347052 0.8209463
#2           2 0.3721239        NA        NA
#3           3 0.5728534        NA        NA
#4           4 0.9082078        NA        NA
#5           5 0.2016819        NA        NA
#6           6 0.8983897 0.3861141        NA

Updated with colMeans instead of colSums which means you don't need to divide by the number of rows any more.

And, just for the record, in base R you could also use colMeans:

dta[,colMeans(is.na(dta)) < 0.5]

回答2:

I think this does the job:

dta %>% select_if(~mean(is.na(.)) < 0.5) %>% head() 


 observation    valueA    valueB    valueC
  1           0.2655087 0.9347052 0.8209463
  2           0.3721239        NA        NA
  3           0.5728534        NA        NA
  4           0.9082078        NA        NA
  5           0.2016819        NA        NA
  6           0.8983897 0.3861141        NA

回答3:

We can use extract from magrittr after getting a logical vector with summarise_each/unlist

library(magrittr)
library(dplyr)
dta %>% 
    summarise_each(funs(sum(is.na(.)) < n()/2)) %>% 
    unlist() %>%
    extract(dta,.)

Or use Filter from base R

Filter(function(x) sum(is.na(x)) < length(x)/2, dta)

Or a slightly compact option is

Filter(function(x) mean(is.na(x)) < 0.5, dta)

来源：https://stackoverflow.com/questions/34852112/conditionally-selecting-columns-in-dplyr-where-certain-proportion-of-values-is-n

标签

filter

dataframe

dplyr

Conditionally selecting columns in dplyr where certain proportion of values is NA

问题