I need to get the name of the columns that have at least 1 NA.
df<-data.frame(a=1:3,b=c(NA,8,6), c=c(\'t\',NA,7))
I need to get \"b, c\"
Another acrobatic solution (just for fun) :
colnames(df)[!complete.cases(t(df))]
[1] "b" "c"
The idea is : Getting the columns of A that have at least 1 NA is equivalent to get the rows that have at least NA for t(A).
complete.cases
by definition (very efficient since it is just a call to C function) gives the rows without any missing value.
names(df)[!!colSums(is.na(df))]
#[1] "b" "c"
colSums(is.na(df)) #gives you the number of missing value per each columns
#a b c
#0 1 1
By using !
, we are creating a logical index
!colSums(is.na(df)) #here the value of `0` will be `TRUE` and all other values `>0` FALSE
# a b c
#TRUE FALSE FALSE
But, we need to select those columns that have atleast one NA
, so !
negate again
!!colSums(is.na(df))
# a b c
#FALSE TRUE TRUE
and use this logical index to get the colnames that have at least one NA
set.seed(49)
df1 <- as.data.frame(matrix(sample(c(NA,1:200), 1e4*5000, replace=TRUE), ncol=5000))
library(microbenchmark)
f1 <- function() {contains_any_na = sapply(df1, function(x) any(is.na(x)))
names(df1)[contains_any_na]}
f2 <- function() {colnames(df1)[!complete.cases(t(df1))] }
f3 <- function() { names(df1)[!!colSums(is.na(df1))] }
microbenchmark(f1(), f2(), f3(), unit="relative")
#Unit: relative
#expr min lq median uq max neval
#f1() 1.000000 1.000000 1.000000 1.000000 1.000000 100
#f2() 8.921109 7.289053 6.852122 6.210826 4.889684 100
#f3() 3.248072 3.105798 2.984453 2.774513 2.599745 100
Maybe surprising sapply
based solution is the winner here because as noted in @flodel comment below , the 2 others solutions created a matrix behind the scene (t(df)
and is.na(df)
) create matrix.
Try the data.table version:
library(data.table)
setDT(df)
names(df)[df[,sapply(.SD, function(x) any(is.na(x))),]]
[1] "b" "c"
Microbenchmarking using @akrun's code:
set.seed(49)
df1 <- as.data.frame(matrix(sample(c(NA,1:200), 1e4*5000, replace=TRUE), ncol=5000))
setDT(df1)
f1 <- function() {contains_any_na = sapply(df1, function(x) any(is.na(x)))
names(df1)[contains_any_na]}
f2 <- function() {colnames(df1)[!complete.cases(t(df1))] }
f3 <- function() { names(df1)[!!colSums(is.na(df1))] }
f4 <- function() { names(df1)[df1[,sapply(.SD, function(x) any(is.na(x))),]] }
microbenchmark(f1(), f2(), f3(), f4(), unit="relative")
# Unit: relative
# expr min lq median uq max neval
# f1() 1.000000 1.000000 1.000000 1.000000 1.000000 100
# f2() 10.459124 10.928821 10.955986 9.858967 7.069066 100
# f3() 3.323144 3.805183 4.159624 3.775549 2.797329 100
# f4() 10.108998 10.242207 10.121022 9.117067 6.576976 100
@agstudy : This solution is similar in speed to colnames(df1)[!complete.cases(t(df1))]
.
A simple one liner for this is :
colnames(df[,sapply(df, function(x) any(is.na(x)))])
Explanation:
sapply(df, function(x) any(is.na(x)))
returns True/False for columns with atleast 1 NA. df[,sapply(df, function(x) any(is.na(x)))]
gets the subset of dataframe that has all its columns with atleast 1 NA. And colnames
gives the names of those columns.
You were very close. Your first try yields a boolean
vector, which you can use to index the names
of df
:
contains_any_na = sapply(df, function(x) any(is.na(x)))
names(df)[contains_any_na]
# [1] "b" "c"
Update Jan 14, 2017: As of R version 3.1.0, anyNA()
can be used as an alternative to any(is.na(.))
, and the above code can be simplified to
names(df)[sapply(df, anyNA)]
# [1] "b" "c"