I\'d like to remove the lines in this data frame that:
a) contain NA
s across all columns. Below is my example data frame.
Assuming dat
as your dataframe, the expected output can be achieved using
1.rowSums
> dat[!rowSums((is.na(dat))),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
2.lapply
> dat[!Reduce('|',lapply(dat,is.na)),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
For your first question, I have a code that I am comfortable with to get rid of all NAs. Thanks for @Gregor to make it simpler.
final[!(rowSums(is.na(final))),]
For the second question, the code is just an alternation from the previous solution.
final[as.logical((rowSums(is.na(final))-5)),]
Notice the -5 is the number of columns in your data. This will eliminate rows with all NAs, since the rowSums adds up to 5 and they become zeroes after subtraction. This time, as.logical is necessary.
data.table
and na.omit()
with optional param cols=
.na.omit.data.table is the fastest on my benchmark (see below), whether for all columns or for select columns (OP question part 2).
data.table
, use complete.cases()
.On a vanilla data.frame
, complete.cases is faster than na.omit() or dplyr::drop_na(). Notice that na.omit.data.frame
does not support cols=
.
Here is a comparison of base (blue), dplyr
(pink), and data.table
(yellow) methods for dropping either all or select missing observations, on notional dataset of 1 million observations of 20 numeric variables with independent 5% likelihood of being missing, and a subset of 4 variables for part 2.
Your results may vary based on length, width, and sparsity of your particular dataset.
Note log scale on y axis.
#------- Adjust these assumptions for your own use case ------------
row_size <- 1e6L
col_size <- 20 # not including ID column
p_missing <- 0.05 # likelihood of missing observation (except ID col)
col_subset <- 18:21 # second part of question: filter on select columns
#------- System info for benchmark ----------------------------------
R.version # R version 3.4.3 (2017-11-30), platform = x86_64-w64-mingw32
library(data.table); packageVersion('data.table') # 1.10.4.3
library(dplyr); packageVersion('dplyr') # 0.7.4
library(tidyr); packageVersion('tidyr') # 0.8.0
library(microbenchmark)
#------- Example dataset using above assumptions --------------------
fakeData <- function(m, n, p){
set.seed(123)
m <- matrix(runif(m*n), nrow=m, ncol=n)
m[m<p] <- NA
return(m)
}
df <- cbind( data.frame(id = paste0('ID',seq(row_size)),
stringsAsFactors = FALSE),
data.frame(fakeData(row_size, col_size, p_missing) )
)
dt <- data.table(df)
par(las=3, mfcol=c(1,2), mar=c(22,4,1,1)+0.1)
boxplot(
microbenchmark(
df[complete.cases(df), ],
na.omit(df),
df %>% drop_na,
dt[complete.cases(dt), ],
na.omit(dt)
), xlab='',
main = 'Performance: Drop any NA observation',
col=c(rep('lightblue',2),'salmon',rep('beige',2))
)
boxplot(
microbenchmark(
df[complete.cases(df[,col_subset]), ],
#na.omit(df), # col subset not supported in na.omit.data.frame
df %>% drop_na(col_subset),
dt[complete.cases(dt[,col_subset,with=FALSE]), ],
na.omit(dt, cols=col_subset) # see ?na.omit.data.table
), xlab='',
main = 'Performance: Drop NA obs. in select cols',
col=c('lightblue','salmon',rep('beige',2))
)
One approach that's both general and yields fairly-readable code is to use the filter()
function and the across()
helper functions from the {dplyr} package.
library(dplyr)
vars_to_check <- c("rnor", "cfam")
# Filter a specific list of columns to keep only non-missing entries
df %>%
filter(across(one_of(vars_to_check),
~ !is.na(.x)))
# Filter all the columns to exclude NA
df %>%
filter(across(everything(),
~ !is.na(.)))
# Filter only numeric columns
df %>%
filter(across(where(is.numeric),
~ !is.na(.)))
Similarly, there are also the variant functions in the dplyr package (filter_all
, filter_at
, filter_if
) which accomplish the same thing:
library(dplyr)
vars_to_check <- c("rnor", "cfam")
# Filter a specific list of columns to keep only non-missing entries
df %>%
filter_at(.vars = vars(one_of(vars_to_check)),
~ !is.na(.))
# Filter all the columns to exclude NA
df %>%
filter_all(~ !is.na(.))
# Filter only numeric columns
df %>%
filter_if(is.numeric,
~ !is.na(.))
delete.dirt <- function(DF, dart=c('NA')) {
dirty_rows <- apply(DF, 1, function(r) !any(r %in% dart))
DF <- DF[dirty_rows, ]
}
mydata <- delete.dirt(mydata)
Above function deletes all the rows from the data frame that has 'NA' in any column and returns the resultant data. If you want to check for multiple values like NA
and ?
change dart=c('NA')
in function param to dart=c('NA', '?')
My guess is that this could be more elegantly solved in this way:
m <- matrix(1:25, ncol = 5)
m[c(1, 6, 13, 25)] <- NA
df <- data.frame(m)
library(dplyr)
df %>%
filter_all(any_vars(is.na(.)))
#> X1 X2 X3 X4 X5
#> 1 NA NA 11 16 21
#> 2 3 8 NA 18 23
#> 3 5 10 15 20 NA