Quickly remove zero variance variables from a data.frame

后端 未结 8 680
独厮守ぢ
独厮守ぢ 2020-12-13 01:07

I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the sa

相关标签:
8条回答
  • 2020-12-13 01:13

    I think having zero variance is equivalent to being constant and one can get around without doing any arithmetic operations at all. I would expect that range() outperforms var(), but I have not verified this:

    removeConstantColumns <- function(a_dataframe, verbose=FALSE) {
      notConstant <- function(x) {
        if (is.factor(x)) x <- as.integer(x)
        return (0 != diff(range(x, na.rm=TRUE)))
      }
      bkeep <- sapply(a_dataframe, notConstant)
      if (verbose) {
        cat('removeConstantColumns: '
          , ifelse(all(bkeep)
            , 'nothing'
            , paste(names(a_dataframe)[!bkeep], collapse=',')
          , ' removed',  '\n')
      }
      return (a_dataframe[, bkeep])
    }
    
    0 讨论(0)
  • 2020-12-13 01:24

    Simply don't use table - it's extremely slow on numeric vectors since it converts them to strings. I would probably use something like

    var0 <- unlist(lapply(df, function(x) 0 == var(if (is.factor(x)) as.integer(x) else x)))
    

    It will be TRUE for 0-variance, NA for columns with NAs and FALSE for non-zero variance

    0 讨论(0)
  • 2020-12-13 01:24

    Use the Caret Package and the function nearZeroVar

    require(caret)
    NZV<- nearZeroVar(dataset, saveMetrics = TRUE)
    NZV[NZV[,"zeroVar"] > 0, ] 
    NZV[NZV[,"zeroVar"] + NZV[,"nzv"] > 0, ]
    
    0 讨论(0)
  • 2020-12-13 01:24

    How about using factor to count the number of unique elements and looping with sapply:

    dat[sapply(dat, function(x) length(levels(factor(x)))>1)]
       B  D F
    1  3 10 I
    2  4 10 J
    3  6 10 I
    4  9 10 J
    5  2 10 I
    6  9 10 J
    7  9 10 I
    8  7 10 J
    9  6 10 I
    10 1  1 J
    

    NAs are excluded by default, but this can be changed with the exclude parameter of factor:

    dat[sapply(dat, function(x) length(levels(factor(x,exclude=NULL)))>1)]
       B  D F  G
    1  3 10 I 10
    2  4 10 J 10
    3  6 10 I 10
    4  9 10 J 10
    5  2 10 I 10
    6  9 10 J 10
    7  9 10 I 10
    8  7 10 J 10
    9  6 10 I 10
    10 1  1 J NA
    
    0 讨论(0)
  • 2020-12-13 01:26

    Check this custom function. I did not try it on data frames with 100+ variables.

    remove_low_variance_cols <- function(df, threshold = 0) {
      n <- Sys.time() #See how long this takes to run
      remove_cols <- df %>%
        select_if(is.numeric) %>%
        map_dfr(var) %>%
        gather() %>% 
        filter(value <= threshold) %>%
        spread(key, value) %>%
        names()
    
      if(length(remove_cols)) {
        print("Removing the following columns: ")
        print(remove_cols)
      }else {
        print("There are no low variance columns with this threshold")
      }
      #How long did this script take?
      print(paste("Time Consumed: ", Sys.time() - n, "Secs."))
      return(df[, setdiff(names(df), remove_cols)])
    }
    
    0 讨论(0)
  • 2020-12-13 01:30

    Don't use table() - very slow for such things. One option is length(unique(x)):

    foo <- function(dat) {
        out <- lapply(dat, function(x) length(unique(x)))
        want <- which(!out > 1)
        unlist(want)
    }
    
    system.time(replicate(1000, zeroVar(dat)))
    system.time(replicate(1000, foo(dat)))
    

    Which is an order magnitude faster than yours on the example data set whilst giving similar output:

    > system.time(replicate(1000, zeroVar(dat)))
       user  system elapsed 
      3.334   0.000   3.335 
    > system.time(replicate(1000, foo(dat)))
       user  system elapsed 
      0.324   0.000   0.324
    

    Simon's solution here is similarly quick on this example:

    > system.time(replicate(1000, which(!unlist(lapply(dat, 
    +             function(x) 0 == var(if (is.factor(x)) as.integer(x) else x))))))
       user  system elapsed 
      0.392   0.000   0.395
    

    but you'll have to see if they scale similarly to real problem sizes.

    0 讨论(0)
提交回复
热议问题