问题
How to exclude from data matrix of nearly 1,000 variables the variables/columns with variance equal to 0 (zero) (i.e. all the cases/observations in the variable/column have the same value)? I can imagine calculate variances for each column and then manually write the numbers of columns to be excluded (or included as this seems to be easier to do in R). But sure there is a more elegant and time saving solution in R. Thank you in advance!
回答1:
We can use Filter
Filter(var, df1)
回答2:
caret
package provides some useful functions to do that: http://topepo.github.io/caret/preprocess.html#nzv:
nearZeroVar: Identification of near zero variance predictors
nearZeroVar diagnoses predictors that have one unique value (i.e. are zero variance predictors) or predictors that are have both of the following characteristics: they have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large. checkConditionalX looks at the distribution of the columns of x conditioned on the levels of y and identifies columns of x that are sparse within groups of y.
Also note: var()
is slow function. We can use a more effective solutions. Comparison of some of them:
dataset <- data.frame(replicate(10, runif(100)),
replicate(10, rep(0, 100)))
microbenchmark::microbenchmark(
var = Filter(var, dataset),
var2 = Filter(function(x) sum((x - sum(x) / length(x))^2), dataset),
range = Filter(function(x) diff(range(x)), dataset),
range2 = Filter(function(x) max(x) - min(x), dataset))
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> var 334.058 359.1545 419.89933 418.8425 439.5935 1707.222 100 c
#> var2 74.457 78.8310 87.47988 87.4805 94.1590 127.932 100 a
#> range 219.973 233.8155 256.30933 260.9380 272.0370 306.272 100 b
#> range2 72.040 75.7300 84.97079 85.1985 90.8195 108.869 100 a
Also we can use length(qunique(x))
for a factors or integers.
About Filter
. Expression
Filter(function(x) max(x) - min(x), dataset)
is similar to
dataset[vapply(dataset, function(x) as.logical(max(x) - min(x)), logical(1))]
but it works a little bit slower.
Note that nearZeroVar()
is more complex and flexible solution.
来源:https://stackoverflow.com/questions/34543045/exclude-columns-with-no-variance