I\'d like to remove the lines in this data frame that:
a) contain NA
s across all columns. Below is my example data frame.
data.table
and na.omit()
with optional param cols=
.na.omit.data.table is the fastest on my benchmark (see below), whether for all columns or for select columns (OP question part 2).
data.table
, use complete.cases()
.On a vanilla data.frame
, complete.cases is faster than na.omit() or dplyr::drop_na(). Notice that na.omit.data.frame
does not support cols=
.
Here is a comparison of base (blue), dplyr
(pink), and data.table
(yellow) methods for dropping either all or select missing observations, on notional dataset of 1 million observations of 20 numeric variables with independent 5% likelihood of being missing, and a subset of 4 variables for part 2.
Your results may vary based on length, width, and sparsity of your particular dataset.
Note log scale on y axis.
#------- Adjust these assumptions for your own use case ------------
row_size <- 1e6L
col_size <- 20 # not including ID column
p_missing <- 0.05 # likelihood of missing observation (except ID col)
col_subset <- 18:21 # second part of question: filter on select columns
#------- System info for benchmark ----------------------------------
R.version # R version 3.4.3 (2017-11-30), platform = x86_64-w64-mingw32
library(data.table); packageVersion('data.table') # 1.10.4.3
library(dplyr); packageVersion('dplyr') # 0.7.4
library(tidyr); packageVersion('tidyr') # 0.8.0
library(microbenchmark)
#------- Example dataset using above assumptions --------------------
fakeData <- function(m, n, p){
set.seed(123)
m <- matrix(runif(m*n), nrow=m, ncol=n)
m[m% drop_na,
dt[complete.cases(dt), ],
na.omit(dt)
), xlab='',
main = 'Performance: Drop any NA observation',
col=c(rep('lightblue',2),'salmon',rep('beige',2))
)
boxplot(
microbenchmark(
df[complete.cases(df[,col_subset]), ],
#na.omit(df), # col subset not supported in na.omit.data.frame
df %>% drop_na(col_subset),
dt[complete.cases(dt[,col_subset,with=FALSE]), ],
na.omit(dt, cols=col_subset) # see ?na.omit.data.table
), xlab='',
main = 'Performance: Drop NA obs. in select cols',
col=c('lightblue','salmon',rep('beige',2))
)