Quickly remove zero variance variables from a data.frame

后端未结

关注

 8  733

I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the sa

相关标签:

8条回答

死守一世寂寞

2020-12-13 01:13

I think having zero variance is equivalent to being constant and one can get around without doing any arithmetic operations at all. I would expect that range() outperforms var(), but I have not verified this:

removeConstantColumns <- function(a_dataframe, verbose=FALSE) {
  notConstant <- function(x) {
    if (is.factor(x)) x <- as.integer(x)
    return (0 != diff(range(x, na.rm=TRUE)))
  }
  bkeep <- sapply(a_dataframe, notConstant)
  if (verbose) {
    cat('removeConstantColumns: '
      , ifelse(all(bkeep)
        , 'nothing'
        , paste(names(a_dataframe)[!bkeep], collapse=',')
      , ' removed',  '\n')
  }
  return (a_dataframe[, bkeep])
}

0 讨论(0)

佛祖请我去吃肉

2020-12-13 01:24
Simply don't use table - it's extremely slow on numeric vectors since it converts them to strings. I would probably use something like
```
var0 <- unlist(lapply(df, function(x) 0 == var(if (is.factor(x)) as.integer(x) else x)))
```
It will be TRUE for 0-variance, NA for columns with NAs and FALSE for non-zero variance
0 讨论(0)
发布评论:

提交评论
- 加载中...

有刺的猬

2020-12-13 01:24

Use the Caret Package and the function nearZeroVar

require(caret)
NZV<- nearZeroVar(dataset, saveMetrics = TRUE)
NZV[NZV[,"zeroVar"] > 0, ] 
NZV[NZV[,"zeroVar"] + NZV[,"nzv"] > 0, ]

0 讨论(0)

余生分开走

2020-12-13 01:24

How about using factor to count the number of unique elements and looping with sapply:

dat[sapply(dat, function(x) length(levels(factor(x)))>1)]
   B  D F
1  3 10 I
2  4 10 J
3  6 10 I
4  9 10 J
5  2 10 I
6  9 10 J
7  9 10 I
8  7 10 J
9  6 10 I
10 1  1 J

NAs are excluded by default, but this can be changed with the exclude parameter of factor:

dat[sapply(dat, function(x) length(levels(factor(x,exclude=NULL)))>1)]
   B  D F  G
1  3 10 I 10
2  4 10 J 10
3  6 10 I 10
4  9 10 J 10
5  2 10 I 10
6  9 10 J 10
7  9 10 I 10
8  7 10 J 10
9  6 10 I 10
10 1  1 J NA

0 讨论(0)

旧时难觅i

2020-12-13 01:26

Check this custom function. I did not try it on data frames with 100+ variables.

remove_low_variance_cols <- function(df, threshold = 0) {
  n <- Sys.time() #See how long this takes to run
  remove_cols <- df %>%
    select_if(is.numeric) %>%
    map_dfr(var) %>%
    gather() %>% 
    filter(value <= threshold) %>%
    spread(key, value) %>%
    names()

  if(length(remove_cols)) {
    print("Removing the following columns: ")
    print(remove_cols)
  }else {
    print("There are no low variance columns with this threshold")
  }
  #How long did this script take?
  print(paste("Time Consumed: ", Sys.time() - n, "Secs."))
  return(df[, setdiff(names(df), remove_cols)])
}

0 讨论(0)

旧时难觅i

2020-12-13 01:30

Don't use table() - very slow for such things. One option is length(unique(x)):

foo <- function(dat) {
    out <- lapply(dat, function(x) length(unique(x)))
    want <- which(!out > 1)
    unlist(want)
}

system.time(replicate(1000, zeroVar(dat)))
system.time(replicate(1000, foo(dat)))

Which is an order magnitude faster than yours on the example data set whilst giving similar output:

> system.time(replicate(1000, zeroVar(dat)))
   user  system elapsed 
  3.334   0.000   3.335 
> system.time(replicate(1000, foo(dat)))
   user  system elapsed 
  0.324   0.000   0.324

Simon's solution here is similarly quick on this example:

> system.time(replicate(1000, which(!unlist(lapply(dat, 
+             function(x) 0 == var(if (is.factor(x)) as.integer(x) else x))))))
   user  system elapsed 
  0.392   0.000   0.395

but you'll have to see if they scale similarly to real problem sizes.

0 讨论(0)

1 2 下一页