Remove rows with all or some NAs (missing values) in data.frame

后端 未结 16 1649
日久生厌
日久生厌 2020-11-21 05:49

I\'d like to remove the lines in this data frame that:

a) contain NAs across all columns. Below is my example data frame.



        
相关标签:
16条回答
  • 2020-11-21 06:19

    I am a synthesizer:). Here I combined the answers into one function:

    #' keep rows that have a certain number (range) of NAs anywhere/somewhere and delete others
    #' @param df a data frame
    #' @param col restrict to the columns where you would like to search for NA; eg, 3, c(3), 2:5, "place", c("place","age")
    #' \cr default is NULL, search for all columns
    #' @param n integer or vector, 0, c(3,5), number/range of NAs allowed.
    #' \cr If a number, the exact number of NAs kept
    #' \cr Range includes both ends 3<=n<=5
    #' \cr Range could be -Inf, Inf
    #' @return returns a new df with rows that have NA(s) removed
    #' @export
    ez.na.keep = function(df, col=NULL, n=0){
        if (!is.null(col)) {
            # R converts a single row/col to a vector if the parameter col has only one col
            # see https://radfordneal.wordpress.com/2008/08/20/design-flaws-in-r-2-%E2%80%94-dropped-dimensions/#comments
            df.temp = df[,col,drop=FALSE]
        } else {
            df.temp = df
        }
    
        if (length(n)==1){
            if (n==0) {
                # simply call complete.cases which might be faster
                result = df[complete.cases(df.temp),]
            } else {
                # credit: http://stackoverflow.com/a/30461945/2292993
                log <- apply(df.temp, 2, is.na)
                logindex <- apply(log, 1, function(x) sum(x) == n)
                result = df[logindex, ]
            }
        }
    
        if (length(n)==2){
            min = n[1]; max = n[2]
            log <- apply(df.temp, 2, is.na)
            logindex <- apply(log, 1, function(x) {sum(x) >= min && sum(x) <= max})
            result = df[logindex, ]
        }
    
        return(result)
    }
    
    0 讨论(0)
  • 2020-11-21 06:22

    Another option if you want greater control over how rows are deemed to be invalid is

    final <- final[!(is.na(final$rnor)) | !(is.na(rawdata$cfam)),]
    

    Using the above, this:

                 gene hsap mmul mmus rnor cfam
    1 ENSG00000208234    0   NA   NA   NA   2
    2 ENSG00000199674    0   2    2    2    2
    3 ENSG00000221622    0   NA   NA   2   NA
    4 ENSG00000207604    0   NA   NA   1    2
    5 ENSG00000207431    0   NA   NA   NA   NA
    6 ENSG00000221312    0   1    2    3    2
    

    Becomes:

                 gene hsap mmul mmus rnor cfam
    1 ENSG00000208234    0   NA   NA   NA   2
    2 ENSG00000199674    0   2    2    2    2
    3 ENSG00000221622    0   NA   NA   2   NA
    4 ENSG00000207604    0   NA   NA   1    2
    6 ENSG00000221312    0   1    2    3    2
    

    ...where only row 5 is removed since it is the only row containing NAs for both rnor AND cfam. The boolean logic can then be changed to fit specific requirements.

    0 讨论(0)
  • 2020-11-21 06:24

    This will return the rows that have at least ONE non-NA value.

    final[rowSums(is.na(final))<length(final),]
    

    This will return the rows that have at least TWO non-NA value.

    final[rowSums(is.na(final))<(length(final)-1),]
    
    0 讨论(0)
  • 2020-11-21 06:25

    If you want control over how many NAs are valid for each row, try this function. For many survey data sets, too many blank question responses can ruin the results. So they are deleted after a certain threshold. This function will allow you to choose how many NAs the row can have before it's deleted:

    delete.na <- function(DF, n=0) {
      DF[rowSums(is.na(DF)) <= n,]
    }
    

    By default, it will eliminate all NAs:

    delete.na(final)
                 gene hsap mmul mmus rnor cfam
    2 ENSG00000199674    0    2    2    2    2
    6 ENSG00000221312    0    1    2    3    2
    

    Or specify the maximum number of NAs allowed:

    delete.na(final, 2)
                 gene hsap mmul mmus rnor cfam
    2 ENSG00000199674    0    2    2    2    2
    4 ENSG00000207604    0   NA   NA    1    2
    6 ENSG00000221312    0    1    2    3    2
    
    0 讨论(0)
提交回复
热议问题