Exclude columns with no variance [duplicate]

This question already has an answer here:

Quickly remove zero variance variables from a data.frame 8 answers

How to exclude from data matrix of nearly 1,000 variables the variables/columns with variance equal to 0 (zero) (i.e. all the cases/observations in the variable/column have the same value)? I can imagine calculate variances for each column and then manually write the numbers of columns to be excluded (or included as this seems to be easier to do in R). But sure there is a more elegant and time saving solution in R. Thank you in advance!

We can use Filter

Filter(var, df1)

caret package provides some useful functions to do that: http://topepo.github.io/caret/preprocess.html#nzv:

nearZeroVar: Identification of near zero variance predictors

nearZeroVar diagnoses predictors that have one unique value (i.e. are zero variance predictors) or predictors that are have both of the following characteristics: they have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large. checkConditionalX looks at the distribution of the columns of x conditioned on the levels of y and identifies columns of x that are sparse within groups of y.

Also note: var() is slow function. We can use a more effective solutions. Comparison of some of them:

dataset <- data.frame(replicate(10, runif(100)),
                      replicate(10, rep(0, 100)))
microbenchmark::microbenchmark(
    var = Filter(var, dataset),
    var2 = Filter(function(x) sum((x - sum(x) / length(x))^2), dataset),
    range = Filter(function(x) diff(range(x)), dataset),
    range2 = Filter(function(x) max(x) - min(x), dataset))
#> Unit: microseconds
#>    expr     min       lq      mean   median       uq      max neval cld
#>     var 334.058 359.1545 419.89933 418.8425 439.5935 1707.222   100   c
#>    var2  74.457  78.8310  87.47988  87.4805  94.1590  127.932   100 a  
#>   range 219.973 233.8155 256.30933 260.9380 272.0370  306.272   100  b 
#>  range2  72.040  75.7300  84.97079  85.1985  90.8195  108.869   100 a

Also we can use length(qunique(x)) for a factors or integers.

About Filter. Expression

Filter(function(x) max(x) - min(x), dataset)

is similar to

dataset[vapply(dataset, function(x) as.logical(max(x) - min(x)), logical(1))]

but it works a little bit slower.

Note that nearZeroVar() is more complex and flexible solution.

来源：https://stackoverflow.com/questions/34543045/exclude-columns-with-no-variance

标签

variance