Viewing all column names with any NA in R

前端 未结 5 833
一整个雨季
一整个雨季 2020-12-17 20:15

I need to get the name of the columns that have at least 1 NA.

df<-data.frame(a=1:3,b=c(NA,8,6), c=c(\'t\',NA,7))

I need to get \"b, c\"

相关标签:
5条回答
  • 2020-12-17 20:32

    Another acrobatic solution (just for fun) :

    colnames(df)[!complete.cases(t(df))]
    [1] "b" "c"
    

    The idea is : Getting the columns of A that have at least 1 NA is equivalent to get the rows that have at least NA for t(A). complete.cases by definition (very efficient since it is just a call to C function) gives the rows without any missing value.

    0 讨论(0)
  • 2020-12-17 20:36
     names(df)[!!colSums(is.na(df))]
     #[1] "b" "c"
    

    Explanation

    colSums(is.na(df)) #gives you the number of missing value per each columns
    #a b c 
    #0 1 1 
    

    By using !, we are creating a logical index

    !colSums(is.na(df))   #here the value of `0` will be `TRUE` and all other values `>0` FALSE
     #   a     b     c 
     #TRUE FALSE FALSE 
    

    But, we need to select those columns that have atleast one NA, so ! negate again

    !!colSums(is.na(df))
    #   a     b     c 
    #FALSE  TRUE  TRUE 
    

    and use this logical index to get the colnames that have at least one NA

    Benchmarks

     set.seed(49)
     df1 <- as.data.frame(matrix(sample(c(NA,1:200), 1e4*5000, replace=TRUE), ncol=5000))
    
     library(microbenchmark)
    
     f1 <- function() {contains_any_na = sapply(df1, function(x) any(is.na(x)))
                names(df1)[contains_any_na]}
    
     f2 <- function() {colnames(df1)[!complete.cases(t(df1))] }
     f3 <- function() { names(df1)[!!colSums(is.na(df1))] }
    
     microbenchmark(f1(), f2(), f3(), unit="relative")
     #Unit: relative
     #expr      min       lq   median       uq      max neval
     #f1() 1.000000 1.000000 1.000000 1.000000 1.000000   100
     #f2() 8.921109 7.289053 6.852122 6.210826 4.889684   100
     #f3() 3.248072 3.105798 2.984453 2.774513 2.599745   100
    

    EDIT performance explanation:

    Maybe surprising sapply based solution is the winner here because as noted in @flodel comment below , the 2 others solutions created a matrix behind the scene (t(df) and is.na(df)) create matrix.

    0 讨论(0)
  • 2020-12-17 20:38

    Try the data.table version:

    library(data.table)
    setDT(df)
    names(df)[df[,sapply(.SD, function(x) any(is.na(x))),]]
    [1] "b" "c"
    

    Microbenchmarking using @akrun's code:

    set.seed(49)
    df1 <- as.data.frame(matrix(sample(c(NA,1:200), 1e4*5000, replace=TRUE), ncol=5000))
    setDT(df1)
    
    
    f1 <- function() {contains_any_na = sapply(df1, function(x) any(is.na(x)))
               names(df1)[contains_any_na]}
    
    f2 <- function() {colnames(df1)[!complete.cases(t(df1))] }
    f3 <- function() { names(df1)[!!colSums(is.na(df1))] }
    
    f4 <- function() { names(df1)[df1[,sapply(.SD, function(x) any(is.na(x))),]] }
    
    microbenchmark(f1(), f2(), f3(), f4(), unit="relative")   
    # Unit: relative
    #  expr       min        lq    median       uq      max neval
    #  f1()  1.000000  1.000000  1.000000 1.000000 1.000000   100
    #  f2() 10.459124 10.928821 10.955986 9.858967 7.069066   100
    #  f3()  3.323144  3.805183  4.159624 3.775549 2.797329   100
    #  f4() 10.108998 10.242207 10.121022 9.117067 6.576976   100
    

    @agstudy : This solution is similar in speed to colnames(df1)[!complete.cases(t(df1))].

    0 讨论(0)
  • 2020-12-17 20:38

    A simple one liner for this is :

    colnames(df[,sapply(df, function(x) any(is.na(x)))])
    

    Explanation:

    sapply(df, function(x) any(is.na(x)))
    

    returns True/False for columns with atleast 1 NA. df[,sapply(df, function(x) any(is.na(x)))] gets the subset of dataframe that has all its columns with atleast 1 NA. And colnames gives the names of those columns.

    0 讨论(0)
  • 2020-12-17 20:51

    You were very close. Your first try yields a boolean vector, which you can use to index the names of df:

    contains_any_na = sapply(df, function(x) any(is.na(x)))
    names(df)[contains_any_na]
    # [1] "b" "c"
    

    Update Jan 14, 2017: As of R version 3.1.0, anyNA() can be used as an alternative to any(is.na(.)), and the above code can be simplified to

    names(df)[sapply(df, anyNA)]
    # [1] "b" "c"
    
    0 讨论(0)
提交回复
热议问题