Exclude columns with no variance [duplicate]

点点圈 提交于 2019-12-02 02:48:30

We can use Filter

Filter(var, df1)

caret package provides some useful functions to do that: http://topepo.github.io/caret/preprocess.html#nzv:

nearZeroVar: Identification of near zero variance predictors

nearZeroVar diagnoses predictors that have one unique value (i.e. are zero variance predictors) or predictors that are have both of the following characteristics: they have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large. checkConditionalX looks at the distribution of the columns of x conditioned on the levels of y and identifies columns of x that are sparse within groups of y.

Also note: var() is slow function. We can use a more effective solutions. Comparison of some of them:

dataset <- data.frame(replicate(10, runif(100)),
                      replicate(10, rep(0, 100)))
microbenchmark::microbenchmark(
    var = Filter(var, dataset),
    var2 = Filter(function(x) sum((x - sum(x) / length(x))^2), dataset),
    range = Filter(function(x) diff(range(x)), dataset),
    range2 = Filter(function(x) max(x) - min(x), dataset))
#> Unit: microseconds
#>    expr     min       lq      mean   median       uq      max neval cld
#>     var 334.058 359.1545 419.89933 418.8425 439.5935 1707.222   100   c
#>    var2  74.457  78.8310  87.47988  87.4805  94.1590  127.932   100 a  
#>   range 219.973 233.8155 256.30933 260.9380 272.0370  306.272   100  b 
#>  range2  72.040  75.7300  84.97079  85.1985  90.8195  108.869   100 a

Also we can use length(qunique(x)) for a factors or integers.

About Filter. Expression

Filter(function(x) max(x) - min(x), dataset)

is similar to

dataset[vapply(dataset, function(x) as.logical(max(x) - min(x)), logical(1))]

but it works a little bit slower.

Note that nearZeroVar() is more complex and flexible solution.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!