Fastest way to drop rows with missing values?

前端 未结 4 438
小鲜肉
小鲜肉 2021-01-02 20:23

I\'m working with a large dataset x. I want to drop rows of x that are missing in one or more columns in a set of columns of x, that

4条回答
  •  执笔经年
    2021-01-02 21:14

    Here is a revised version of a c++ solution with a number of modifications based on a long discussion with Matthew (see comments below). I am new to c so I am sure that someone might still be able to improve this.

    After library("RcppArmadillo") you should be able to run the whole file including the benchmark using sourceCpp('cleanmat.cpp'). The c++-file includes two functions. cleanmat takes two arguments (X and the index of the columns) and returns the matrix without the columns with missing values. keep just takes one argument X and returns a logical vector.

    Note about passing data.table objects: These functions do not accept a data.table as an argument. The functions have to be modified to take DataFrame as an argument (see here.

    cleanmat.cpp

    #include 
    // [[Rcpp::depends(RcppArmadillo)]]
    
    using namespace Rcpp;
    using namespace arma;
    
    
    // [[Rcpp::export]]
    mat cleanmat(mat X, uvec idx) {
        // remove colums
        X = X.cols(idx - 1);
        // get dimensions
        int n = X.n_rows,k = X.n_cols;
        // create keep vector
        vec keep = ones(n);
        for (int j = 0; j < k; j++) 
            for (int i = 0; i < n; i++) 
                if (keep[i] && !is_finite(X(i,j))) keep[i] = 0;
        // alternative with view for each row (slightly slower)
        /*vec keep = zeros(n);
        for (int i = 0; i < n; i++) {
             keep(i) = is_finite(X.row(i));
        }*/  
        return (X.rows(find(keep==1)));
    }
    
    
    // [[Rcpp::export]]
    LogicalVector keep(NumericMatrix X) {
        int n = X.nrow(), k = X.ncol();
        // create keep vector
        LogicalVector keep(n, true);
        for (int j = 0; j < k; j++) 
            for (int i = 0; i < n; i++) 
                if (keep[i] && NumericVector::is_na(X(i,j))) keep[i] = false;
    
        return (keep);
    }
    
    
    /*** R
    require("Rcpp")
    require("RcppArmadillo")
    require("data.table")
    require("microbenchmark")
    
    # create matrix
    X = matrix(rnorm(1e+07),ncol=100)
    X[sample(nrow(X),1000,replace = TRUE),sample(ncol(X),1000,replace = TRUE)]=NA
    colnames(X)=paste("c",1:ncol(X),sep="")
    
    idx=sample(ncol(X),90)
    microbenchmark(
      X[!apply(X[,idx],1,function(X) any(is.na(X))),idx],
      X[rowSums(is.na(X[,idx])) == 0, idx],
      cleanmat(X,idx),
      X[keep(X[,idx]),idx],
    times=3)
    
    # output
    # Unit: milliseconds
    #                                                     expr       min        lq    median        uq       max
    # 1                                       cleanmat(X, idx)  253.2596  259.7738  266.2880  272.0900  277.8921
    # 2 X[!apply(X[, idx], 1, function(X) any(is.na(X))), idx] 1729.5200 1805.3255 1881.1309 1913.7580 1946.3851
    # 3                                 X[keep(X[, idx]), idx]  360.8254  361.5165  362.2077  371.2061  380.2045
    # 4                  X[rowSums(is.na(X[, idx])) == 0, idx]  358.4772  367.5698  376.6625  379.6093  382.5561
    
    */
    

提交回复
热议问题