Fastest way to drop rows with missing values?

前端 未结 4 439
小鲜肉
小鲜肉 2021-01-02 20:23

I\'m working with a large dataset x. I want to drop rows of x that are missing in one or more columns in a set of columns of x, that

相关标签:
4条回答
  • 2021-01-02 21:08

    Two more approaches

    two vector scans

    x[!is.na(var1) & !is.na(var2)]
    

    join with unique combinations of non-NA values

    If you know the possible unique values in advance, this will be the fastest

    system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
    

    Some timings

    x <-data.table(var1 = sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
                    var2= sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
                    key = c('var1','var2'))
    
    system.time(x[rowSums(is.na(x[, ..varcols])) == 0, ])
       user  system elapsed 
       0.09    0.02    0.11 
    
     system.time(x[!is.na(var1) & !is.na(var2)])
       user  system elapsed 
       0.06    0.02    0.07 
    
    
     system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
       user  system elapsed 
       0.03    0.00    0.04 
    
    0 讨论(0)
  • 2021-01-02 21:12

    For speed, with a large number of varcols, perhaps look to iterate by column. Something like this (untested) :

    keep = rep(TRUE,nrow(x))
    for (j in varcols) keep[is.na(x[[j]])] = FALSE
    x[keep]
    

    The issue with is.na is that it creates a new logical vector to hold its result, which then must be looped through by R to find the TRUEs so it knows which of the keep to set FALSE. However, in the above for loop, R can reuse the (identically sized) previous temporary memory for that result of is.na, since it is marked unused and available for reuse after each iteration completes. IIUC.

    1. is.na(x[, ..varcols])
    This is ok but creates a large copy to hold the logical matrix as large as length(varcols). And the ==0 on the result of rowSums will need a new vector, too.

    2. !is.na(var1) & !is.na(var2)
    Ok too, but ! will create a new vector again and so will &. Each of the results of is.na have to be held by R separately until the expression completes. Probably makes no difference until length(varcols) increases a lot, or ncol(x) is very large.

    3. CJ(c(0,1),c(0,1))
    Best so far but not sure how this would scale as length(varcols) increases. CJ needs to allocate new memory, and it loops through to populate that memory with all the combinations, before the join can start.

    So, the very fastest (I guess), would be a C version like this (pseudo-code) :

    keep = rep(TRUE,nrow(x))
    for (j=0; j<varcols; j++)
       for (i=0; i<nrow(x); i++)
           if (keep[i] && ISNA(x[i,j])) keep[i] = FALSE;
    x[keep]
    

    That would need one single allocation for keep (in C or R) and then the C loop would loop through the columns updating keep whenever it saw an NA. The C could be done in Rcpp, in RStudio, inline package, or old school. It's important the two loops are that way round, for cache efficiency. The thinking is that the keep[i] && part helps speed when there are a lot of NA in some rows, to save even fetching the later column values at all after the first NA in each row.

    0 讨论(0)
  • 2021-01-02 21:14

    Here is a revised version of a c++ solution with a number of modifications based on a long discussion with Matthew (see comments below). I am new to c so I am sure that someone might still be able to improve this.

    After library("RcppArmadillo") you should be able to run the whole file including the benchmark using sourceCpp('cleanmat.cpp'). The c++-file includes two functions. cleanmat takes two arguments (X and the index of the columns) and returns the matrix without the columns with missing values. keep just takes one argument X and returns a logical vector.

    Note about passing data.table objects: These functions do not accept a data.table as an argument. The functions have to be modified to take DataFrame as an argument (see here.

    cleanmat.cpp

    #include <RcppArmadillo.h>
    // [[Rcpp::depends(RcppArmadillo)]]
    
    using namespace Rcpp;
    using namespace arma;
    
    
    // [[Rcpp::export]]
    mat cleanmat(mat X, uvec idx) {
        // remove colums
        X = X.cols(idx - 1);
        // get dimensions
        int n = X.n_rows,k = X.n_cols;
        // create keep vector
        vec keep = ones<vec>(n);
        for (int j = 0; j < k; j++) 
            for (int i = 0; i < n; i++) 
                if (keep[i] && !is_finite(X(i,j))) keep[i] = 0;
        // alternative with view for each row (slightly slower)
        /*vec keep = zeros<vec>(n);
        for (int i = 0; i < n; i++) {
             keep(i) = is_finite(X.row(i));
        }*/  
        return (X.rows(find(keep==1)));
    }
    
    
    // [[Rcpp::export]]
    LogicalVector keep(NumericMatrix X) {
        int n = X.nrow(), k = X.ncol();
        // create keep vector
        LogicalVector keep(n, true);
        for (int j = 0; j < k; j++) 
            for (int i = 0; i < n; i++) 
                if (keep[i] && NumericVector::is_na(X(i,j))) keep[i] = false;
    
        return (keep);
    }
    
    
    /*** R
    require("Rcpp")
    require("RcppArmadillo")
    require("data.table")
    require("microbenchmark")
    
    # create matrix
    X = matrix(rnorm(1e+07),ncol=100)
    X[sample(nrow(X),1000,replace = TRUE),sample(ncol(X),1000,replace = TRUE)]=NA
    colnames(X)=paste("c",1:ncol(X),sep="")
    
    idx=sample(ncol(X),90)
    microbenchmark(
      X[!apply(X[,idx],1,function(X) any(is.na(X))),idx],
      X[rowSums(is.na(X[,idx])) == 0, idx],
      cleanmat(X,idx),
      X[keep(X[,idx]),idx],
    times=3)
    
    # output
    # Unit: milliseconds
    #                                                     expr       min        lq    median        uq       max
    # 1                                       cleanmat(X, idx)  253.2596  259.7738  266.2880  272.0900  277.8921
    # 2 X[!apply(X[, idx], 1, function(X) any(is.na(X))), idx] 1729.5200 1805.3255 1881.1309 1913.7580 1946.3851
    # 3                                 X[keep(X[, idx]), idx]  360.8254  361.5165  362.2077  371.2061  380.2045
    # 4                  X[rowSums(is.na(X[, idx])) == 0, idx]  358.4772  367.5698  376.6625  379.6093  382.5561
    
    */
    
    0 讨论(0)
  • 2021-01-02 21:17

    This should be faster than using apply:

    x[rowSums(is.na(x[, ..varcols])) == 0, ]
    #    var1 var2 textcol
    # 1:    0    0       e
    # 2:    0    1       f
    # 3:    1    0       h
    # 4:    1    1       i
    
    0 讨论(0)
提交回复
热议问题