Remove rows with all or some NAs (missing values) in data.frame

后端 未结 16 1655
日久生厌
日久生厌 2020-11-21 05:49

I\'d like to remove the lines in this data frame that:

a) contain NAs across all columns. Below is my example data frame.



        
16条回答
  •  灰色年华
    2020-11-21 06:13

    If performance is a priority, use data.table and na.omit() with optional param cols=.

    na.omit.data.table is the fastest on my benchmark (see below), whether for all columns or for select columns (OP question part 2).

    If you don't want to use data.table, use complete.cases().

    On a vanilla data.frame, complete.cases is faster than na.omit() or dplyr::drop_na(). Notice that na.omit.data.frame does not support cols=.

    Benchmark result

    Here is a comparison of base (blue), dplyr (pink), and data.table (yellow) methods for dropping either all or select missing observations, on notional dataset of 1 million observations of 20 numeric variables with independent 5% likelihood of being missing, and a subset of 4 variables for part 2.

    Your results may vary based on length, width, and sparsity of your particular dataset.

    Note log scale on y axis.

    Benchmark script

    #-------  Adjust these assumptions for your own use case  ------------
    row_size   <- 1e6L 
    col_size   <- 20    # not including ID column
    p_missing  <- 0.05   # likelihood of missing observation (except ID col)
    col_subset <- 18:21  # second part of question: filter on select columns
    
    #-------  System info for benchmark  ----------------------------------
    R.version # R version 3.4.3 (2017-11-30), platform = x86_64-w64-mingw32
    library(data.table); packageVersion('data.table') # 1.10.4.3
    library(dplyr);      packageVersion('dplyr')      # 0.7.4
    library(tidyr);      packageVersion('tidyr')      # 0.8.0
    library(microbenchmark)
    
    #-------  Example dataset using above assumptions  --------------------
    fakeData <- function(m, n, p){
      set.seed(123)
      m <-  matrix(runif(m*n), nrow=m, ncol=n)
      m[m% drop_na,
        dt[complete.cases(dt), ],
        na.omit(dt)
      ), xlab='', 
      main = 'Performance: Drop any NA observation',
      col=c(rep('lightblue',2),'salmon',rep('beige',2))
    )
    boxplot(
      microbenchmark(
        df[complete.cases(df[,col_subset]), ],
        #na.omit(df), # col subset not supported in na.omit.data.frame
        df %>% drop_na(col_subset),
        dt[complete.cases(dt[,col_subset,with=FALSE]), ],
        na.omit(dt, cols=col_subset) # see ?na.omit.data.table
      ), xlab='', 
      main = 'Performance: Drop NA obs. in select cols',
      col=c('lightblue','salmon',rep('beige',2))
    )
    

提交回复
热议问题