Remove rows with all or some NAs (missing values) in data.frame

后端 未结 16 1650
日久生厌
日久生厌 2020-11-21 05:49

I\'d like to remove the lines in this data frame that:

a) contain NAs across all columns. Below is my example data frame.



        
相关标签:
16条回答
  • 2020-11-21 06:10

    Assuming dat as your dataframe, the expected output can be achieved using

    1.rowSums

    > dat[!rowSums((is.na(dat))),]
                 gene hsap mmul mmus rnor cfam
    2 ENSG00000199674    0   2    2    2    2
    6 ENSG00000221312    0   1    2    3    2
    

    2.lapply

    > dat[!Reduce('|',lapply(dat,is.na)),]
                 gene hsap mmul mmus rnor cfam
    2 ENSG00000199674    0   2    2    2    2
    6 ENSG00000221312    0   1    2    3    2
    
    0 讨论(0)
  • 2020-11-21 06:12

    For your first question, I have a code that I am comfortable with to get rid of all NAs. Thanks for @Gregor to make it simpler.

    final[!(rowSums(is.na(final))),]
    

    For the second question, the code is just an alternation from the previous solution.

    final[as.logical((rowSums(is.na(final))-5)),]
    

    Notice the -5 is the number of columns in your data. This will eliminate rows with all NAs, since the rowSums adds up to 5 and they become zeroes after subtraction. This time, as.logical is necessary.

    0 讨论(0)
  • 2020-11-21 06:13

    If performance is a priority, use data.table and na.omit() with optional param cols=.

    na.omit.data.table is the fastest on my benchmark (see below), whether for all columns or for select columns (OP question part 2).

    If you don't want to use data.table, use complete.cases().

    On a vanilla data.frame, complete.cases is faster than na.omit() or dplyr::drop_na(). Notice that na.omit.data.frame does not support cols=.

    Benchmark result

    Here is a comparison of base (blue), dplyr (pink), and data.table (yellow) methods for dropping either all or select missing observations, on notional dataset of 1 million observations of 20 numeric variables with independent 5% likelihood of being missing, and a subset of 4 variables for part 2.

    Your results may vary based on length, width, and sparsity of your particular dataset.

    Note log scale on y axis.

    Benchmark script

    #-------  Adjust these assumptions for your own use case  ------------
    row_size   <- 1e6L 
    col_size   <- 20    # not including ID column
    p_missing  <- 0.05   # likelihood of missing observation (except ID col)
    col_subset <- 18:21  # second part of question: filter on select columns
    
    #-------  System info for benchmark  ----------------------------------
    R.version # R version 3.4.3 (2017-11-30), platform = x86_64-w64-mingw32
    library(data.table); packageVersion('data.table') # 1.10.4.3
    library(dplyr);      packageVersion('dplyr')      # 0.7.4
    library(tidyr);      packageVersion('tidyr')      # 0.8.0
    library(microbenchmark)
    
    #-------  Example dataset using above assumptions  --------------------
    fakeData <- function(m, n, p){
      set.seed(123)
      m <-  matrix(runif(m*n), nrow=m, ncol=n)
      m[m<p] <- NA
      return(m)
    }
    df <- cbind( data.frame(id = paste0('ID',seq(row_size)), 
                            stringsAsFactors = FALSE),
                 data.frame(fakeData(row_size, col_size, p_missing) )
                 )
    dt <- data.table(df)
    
    par(las=3, mfcol=c(1,2), mar=c(22,4,1,1)+0.1)
    boxplot(
      microbenchmark(
        df[complete.cases(df), ],
        na.omit(df),
        df %>% drop_na,
        dt[complete.cases(dt), ],
        na.omit(dt)
      ), xlab='', 
      main = 'Performance: Drop any NA observation',
      col=c(rep('lightblue',2),'salmon',rep('beige',2))
    )
    boxplot(
      microbenchmark(
        df[complete.cases(df[,col_subset]), ],
        #na.omit(df), # col subset not supported in na.omit.data.frame
        df %>% drop_na(col_subset),
        dt[complete.cases(dt[,col_subset,with=FALSE]), ],
        na.omit(dt, cols=col_subset) # see ?na.omit.data.table
      ), xlab='', 
      main = 'Performance: Drop NA obs. in select cols',
      col=c('lightblue','salmon',rep('beige',2))
    )
    
    0 讨论(0)
  • 2020-11-21 06:14

    One approach that's both general and yields fairly-readable code is to use the filter() function and the across() helper functions from the {dplyr} package.

    library(dplyr)
    
    vars_to_check <- c("rnor", "cfam")
    
    # Filter a specific list of columns to keep only non-missing entries
    
    df %>% 
      filter(across(one_of(vars_to_check),
                    ~ !is.na(.x)))
    
    # Filter all the columns to exclude NA
    df %>% 
      filter(across(everything(),
                    ~ !is.na(.)))
    
    # Filter only numeric columns
    df %>%
      filter(across(where(is.numeric),
                    ~ !is.na(.)))
    

    Similarly, there are also the variant functions in the dplyr package (filter_all, filter_at, filter_if) which accomplish the same thing:

    library(dplyr)
    
    vars_to_check <- c("rnor", "cfam")
    
    # Filter a specific list of columns to keep only non-missing entries
    df %>% 
      filter_at(.vars = vars(one_of(vars_to_check)),
                ~ !is.na(.))
    
    # Filter all the columns to exclude NA
    df %>% 
      filter_all(~ !is.na(.))
    
    # Filter only numeric columns
    df %>%
      filter_if(is.numeric,
                ~ !is.na(.))
    
    0 讨论(0)
  • 2020-11-21 06:16
    delete.dirt <- function(DF, dart=c('NA')) {
      dirty_rows <- apply(DF, 1, function(r) !any(r %in% dart))
      DF <- DF[dirty_rows, ]
    }
    
    mydata <- delete.dirt(mydata)
    

    Above function deletes all the rows from the data frame that has 'NA' in any column and returns the resultant data. If you want to check for multiple values like NA and ? change dart=c('NA') in function param to dart=c('NA', '?')

    0 讨论(0)
  • 2020-11-21 06:17

    My guess is that this could be more elegantly solved in this way:

      m <- matrix(1:25, ncol = 5)
      m[c(1, 6, 13, 25)] <- NA
      df <- data.frame(m)
      library(dplyr) 
      df %>%
      filter_all(any_vars(is.na(.)))
      #>   X1 X2 X3 X4 X5
      #> 1 NA NA 11 16 21
      #> 2  3  8 NA 18 23
      #> 3  5 10 15 20 NA
    
    0 讨论(0)
提交回复
热议问题