Equal frequency discretization in R

匿名 (未验证) 提交于 2019-12-03 01:29:01

问题:

I'm having trouble finding a function in R that performs equal-frequency discretization. I stumbled on the 'infotheo' package, but after some testing I found that the algorithm is broken. 'dprep' seems to no longer be supported on CRAN.

EDIT :

For clarity, I do not need to seperate the values between the bins. I really want equal frequency, it doesn't matter if one value ends up in two bins. Eg :

c(1,3,2,1,2,2)  

should give a bin c(1,1,2) and one c(2,2,3)

回答1:

EDIT : given your real goal, why don't you just do (corrected) :

 EqualFreq2 

This returns a vector with indicators for which bin they are. But as some values might be present in both bins, you can't possibly define the bin limits. But you can do :

x 

Original answer:

You can easily just use cut() for this :

EqualFreq 0 stop("n is too large.")      cut(x,breaks,include.lowest=include.lowest,...)  } 

Which gives :

set.seed(12345) x 

As you see, for discrete data an optimal equal binning is rather impossible in most cases, but this method gives you the best possible binning available.



回答2:

This sort of thing is also quite easily solved by using (abusing?) the conditioning plot infrastructure from lattice, in particular function co.intervals():

cutEqual 

Which reproduces @Joris' excellent answer:

> set.seed(12345) > x  table(cutEqual(x, 5))   [-2.38,-0.885] (-0.885,-0.115]  (-0.115,0.587]   (0.587,0.938]     (0.938,2.2]               10              10              10              10              10 > y  table(cutEqual(y, 5))   [0.5,3.5]  (3.5,5.5]  (5.5,6.5]  (6.5,7.5] (7.5,11.5]          10         13         11          6         10 

In the latter, discrete, case the breaks are different although they have the same effect; the same observations are in the same bins.



回答3:

How about?

a  table(Hmisc::cut2(a, m = 10))  [-2.2020,-0.7710) [-0.7710,-0.2352) [-0.2352, 0.0997) [ 0.0997, 0.9775)                 10                10                10                10  [ 0.9775, 2.5677]                 10  


回答4:

Here is a function that handle the error :'breaks' are not unique, and automatically select the closest n_bins value to the one you setted up.

equal_freq 1)   {     n_bins=n_bins-1     res=tryCatch(cut_number(var, n = n_bins), error=function(e) {return (e)})    }   if(n_bins_orig != n_bins)     warning(sprintf("It's not possible to calculate with n_bins=%s, setting n_bins in: %s.", n_bins_orig, n_bins))    return(res) } 

Example:

equal_freq(mtcars$carb, 10) 

Which retrieves the binned variable and the following warning:

It's not possible to calculate with n_bins=10, setting n_bins in: 5. 


回答5:

Here is a one liner solution inspired by @Joris' answer:

x 


回答6:

The classInt library is created "for choosing univariate class intervals for mapping or other graphics purposes". You can just do:

dataset 

where 2 is the number of bins you want and the quantile style provides quantile breaks. There are several styles available for this function: "fixed", "sd", "equal", "pretty", "quantile", "kmeans", "hclust", "bclust", "fisher", or "jenks". Check docs for more info.



回答7:

Here's another solution using mltools.

set.seed(1) x 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!