Remove rows with all or some NAs (missing values) in data.frame

问题

I\'d like to remove the lines in this data frame that:

a) contain NAs across all columns. Below is my example data frame.

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   NA
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   NA   NA
4 ENSG00000207604    0   NA   NA   1    2
5 ENSG00000207431    0   NA   NA   NA   NA
6 ENSG00000221312    0   1    2    3    2

Basically, I\'d like to get a data frame such as the following.

             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
6 ENSG00000221312    0   1    2    3    2

b) contain NAs in only some columns, so I can also get this result:

             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
4 ENSG00000207604    0   NA   NA   1    2
6 ENSG00000221312    0   1    2    3    2

回答1:

Also check complete.cases :

> final[complete.cases(final), ]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
6 ENSG00000221312    0    1    2    3    2

na.omit is nicer for just removing all NA's. complete.cases allows partial selection by including only certain columns of the dataframe:

> final[complete.cases(final[ , 5:6]),]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2

Your solution can't work. If you insist on using is.na, then you have to do something like:

> final[rowSums(is.na(final[ , 5:6])) == 0, ]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2

but using complete.cases is quite a lot more clear, and faster.

回答2:

Try na.omit(your.data.frame). As for the second question, try posting it as another question (for clarity).

回答3:

tidyr has a new function drop_na:

library(tidyr)
df %>% drop_na()
#              gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674    0    2    2    2    2
# 6 ENSG00000221312    0    1    2    3    2
df %>% drop_na(rnor, cfam)
#              gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674    0    2    2    2    2
# 4 ENSG00000207604    0   NA   NA    1    2
# 6 ENSG00000221312    0    1    2    3    2

回答4:

I prefer following way to check whether rows contain any NAs:

row.has.na <- apply(final, 1, function(x){any(is.na(x))})

This returns logical vector with values denoting whether there is any NA in a row. You can use it to see how many rows you'll have to drop:

sum(row.has.na)

and eventually drop them

final.filtered <- final[!row.has.na,]

For filtering rows with certain part of NAs it becomes a little trickier (for example, you can feed 'final[,5:6]' to 'apply'). Generally, Joris Meys' solution seems to be more elegant.

回答5:

Another option if you want greater control over how rows are deemed to be invalid is

final <- final[!(is.na(final$rnor)) | !(is.na(rawdata$cfam)),]

Using the above, this:

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   2
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   2   NA
4 ENSG00000207604    0   NA   NA   1    2
5 ENSG00000207431    0   NA   NA   NA   NA
6 ENSG00000221312    0   1    2    3    2

Becomes:

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   2
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   2   NA
4 ENSG00000207604    0   NA   NA   1    2
6 ENSG00000221312    0   1    2    3    2

...where only row 5 is removed since it is the only row containing NAs for both rnor AND cfam. The boolean logic can then be changed to fit specific requirements.

回答6:

If you want control over how many NAs are valid for each row, try this function. For many survey data sets, too many blank question responses can ruin the results. So they are deleted after a certain threshold. This function will allow you to choose how many NAs the row can have before it's deleted:

delete.na <- function(DF, n=0) {
  DF[rowSums(is.na(DF)) <= n,]
}

By default, it will eliminate all NAs:

delete.na(final)
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
6 ENSG00000221312    0    1    2    3    2

Or specify the maximum number of NAs allowed:

delete.na(final, 2)
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2

回答7:

If performance is a priority, use `data.table` and `na.omit()` with optional param `cols=`.

na.omit.data.table is the fastest on my benchmark (see below), whether for all columns or for select columns (OP question part 2).

If you don't want to use `data.table`, use `complete.cases()`.

On a vanilla data.frame, complete.cases is faster than na.omit() or dplyr::drop_na(). Notice that na.omit.data.frame does not support cols=.

Benchmark result

Here is a comparison of base (blue), dplyr (pink), and data.table (yellow) methods for dropping either all or select missing observations, on notional dataset of 1 million observations of 20 numeric variables with independent 5% likelihood of being missing, and a subset of 4 variables for part 2.

Your results may vary based on length, width, and sparsity of your particular dataset.

Note log scale on y axis.

Benchmark script

#-------  Adjust these assumptions for your own use case  ------------
row_size   <- 1e6L 
col_size   <- 20    # not including ID column
p_missing  <- 0.05   # likelihood of missing observation (except ID col)
col_subset <- 18:21  # second part of question: filter on select columns

#-------  System info for benchmark  ----------------------------------
R.version # R version 3.4.3 (2017-11-30), platform = x86_64-w64-mingw32
library(data.table); packageVersion('data.table') # 1.10.4.3
library(dplyr);      packageVersion('dplyr')      # 0.7.4
library(tidyr);      packageVersion('tidyr')      # 0.8.0
library(microbenchmark)

#-------  Example dataset using above assumptions  --------------------
fakeData <- function(m, n, p){
  set.seed(123)
  m <-  matrix(runif(m*n), nrow=m, ncol=n)
  m[m<p] <- NA
  return(m)
}
df <- cbind( data.frame(id = paste0('ID',seq(row_size)), 
                        stringsAsFactors = FALSE),
             data.frame(fakeData(row_size, col_size, p_missing) )
             )
dt <- data.table(df)

par(las=3, mfcol=c(1,2), mar=c(22,4,1,1)+0.1)
boxplot(
  microbenchmark(
    df[complete.cases(df), ],
    na.omit(df),
    df %>% drop_na,
    dt[complete.cases(dt), ],
    na.omit(dt)
  ), xlab='', 
  main = 'Performance: Drop any NA observation',
  col=c(rep('lightblue',2),'salmon',rep('beige',2))
)
boxplot(
  microbenchmark(
    df[complete.cases(df[,col_subset]), ],
    #na.omit(df), # col subset not supported in na.omit.data.frame
    df %>% drop_na(col_subset),
    dt[complete.cases(dt[,col_subset,with=FALSE]), ],
    na.omit(dt, cols=col_subset) # see ?na.omit.data.table
  ), xlab='', 
  main = 'Performance: Drop NA obs. in select cols',
  col=c('lightblue','salmon',rep('beige',2))
)

回答8:

Using dplyr package we can filter NA as follows:

dplyr::filter(df,  !is.na(columnname))

回答9:

This will return the rows that have at least ONE non-NA value.

final[rowSums(is.na(final))<length(final),]

This will return the rows that have at least TWO non-NA value.

final[rowSums(is.na(final))<(length(final)-1),]

回答10:

For your first question, I have a code that I am comfortable with to get rid of all NAs. Thanks for @Gregor to make it simpler.

final[!(rowSums(is.na(final))),]

For the second question, the code is just an alternation from the previous solution.

final[as.logical((rowSums(is.na(final))-5)),]

Notice the -5 is the number of columns in your data. This will eliminate rows with all NAs, since the rowSums adds up to 5 and they become zeroes after subtraction. This time, as.logical is necessary.

回答11:

We can also use the subset function for this.

finalData<-subset(data,!(is.na(data["mmul"]) | is.na(data["rnor"])))

This will give only those rows that do not have NA in both mmul and rnor

回答12:

I am a synthesizer:). Here I combined the answers into one function:

#' keep rows that have a certain number (range) of NAs anywhere/somewhere and delete others
#' @param df a data frame
#' @param col restrict to the columns where you would like to search for NA; eg, 3, c(3), 2:5, "place", c("place","age")
#' \cr default is NULL, search for all columns
#' @param n integer or vector, 0, c(3,5), number/range of NAs allowed.
#' \cr If a number, the exact number of NAs kept
#' \cr Range includes both ends 3<=n<=5
#' \cr Range could be -Inf, Inf
#' @return returns a new df with rows that have NA(s) removed
#' @export
ez.na.keep = function(df, col=NULL, n=0){
    if (!is.null(col)) {
        # R converts a single row/col to a vector if the parameter col has only one col
        # see https://radfordneal.wordpress.com/2008/08/20/design-flaws-in-r-2-%E2%80%94-dropped-dimensions/#comments
        df.temp = df[,col,drop=FALSE]
    } else {
        df.temp = df
    }

    if (length(n)==1){
        if (n==0) {
            # simply call complete.cases which might be faster
            result = df[complete.cases(df.temp),]
        } else {
            # credit: http://stackoverflow.com/a/30461945/2292993
            log <- apply(df.temp, 2, is.na)
            logindex <- apply(log, 1, function(x) sum(x) == n)
            result = df[logindex, ]
        }
    }

    if (length(n)==2){
        min = n[1]; max = n[2]
        log <- apply(df.temp, 2, is.na)
        logindex <- apply(log, 1, function(x) {sum(x) >= min && sum(x) <= max})
        result = df[logindex, ]
    }

    return(result)
}

回答13:

Assuming dat as your dataframe, the expected output can be achieved using

1.rowSums

> dat[!rowSums((is.na(dat))),]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
6 ENSG00000221312    0   1    2    3    2

2.lapply

> dat[!Reduce('|',lapply(dat,is.na)),]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
6 ENSG00000221312    0   1    2    3    2

回答14:

delete.dirt <- function(DF, dart=c('NA')) {
  dirty_rows <- apply(DF, 1, function(r) !any(r %in% dart))
  DF <- DF[dirty_rows, ]
}

mydata <- delete.dirt(mydata)

Above function deletes all the rows from the data frame that has 'NA' in any column and returns the resultant data. If you want to check for multiple values like NA and ? change dart=c('NA') in function param to dart=c('NA', '?')

回答15:

One approach that's both general and yields fairly-readable code is to use the filter function and its variants in the dplyr package (filter_all, filter_at, filter_if):

library(dplyr)

vars_to_check <- c("rnor", "cfam")

# Filter a specific list of columns to keep only non-missing entries
df %>% 
  filter_at(.vars = vars(one_of(vars_to_check)),
            ~ !is.na(.))

# Filter all the columns to exclude NA
df %>% 
  filter_all(~ !is.na(.))

# Filter only numeric columns
df %>%
  filter_if(is.numeric,
            ~ !is.na(.))

回答16:

My guess is that this could be more elegantly solved in this way

  m <- matrix(1:25, ncol = 5)
  m[c(1, 6, 13, 25)] <- NA
  df <- data.frame(m)
  library(dplyr) 
  df %>%
  filter_all(any_vars(is.na(.)))
  #>   X1 X2 X3 X4 X5
  #> 1 NA NA 11 16 21
  #> 2  3  8 NA 18 23
  #> 3  5 10 15 20 NA

来源：https://stackoverflow.com/questions/4862178/remove-rows-with-all-or-some-nas-missing-values-in-data-frame

标签

dataframe

filter

missing-data

r-faq