可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm having trouble finding a function in R that performs equal-frequency discretization. I stumbled on the 'infotheo' package, but after some testing I found that the algorithm is broken. 'dprep' seems to no longer be supported on CRAN.
EDIT :
For clarity, I do not need to seperate the values between the bins. I really want equal frequency, it doesn't matter if one value ends up in two bins. Eg :
c(1,3,2,1,2,2)
should give a bin c(1,1,2)
and one c(2,2,3)
回答1:
EDIT : given your real goal, why don't you just do (corrected) :
EqualFreq2
This returns a vector with indicators for which bin they are. But as some values might be present in both bins, you can't possibly define the bin limits. But you can do :
x
Original answer:
You can easily just use cut()
for this :
EqualFreq 0 stop("n is too large.") cut(x,breaks,include.lowest=include.lowest,...) }
Which gives :
set.seed(12345) x
As you see, for discrete data an optimal equal binning is rather impossible in most cases, but this method gives you the best possible binning available.
回答2:
This sort of thing is also quite easily solved by using (abusing?) the conditioning plot infrastructure from lattice, in particular function co.intervals()
:
cutEqual
Which reproduces @Joris' excellent answer:
> set.seed(12345) > x table(cutEqual(x, 5)) [-2.38,-0.885] (-0.885,-0.115] (-0.115,0.587] (0.587,0.938] (0.938,2.2] 10 10 10 10 10 > y table(cutEqual(y, 5)) [0.5,3.5] (3.5,5.5] (5.5,6.5] (6.5,7.5] (7.5,11.5] 10 13 11 6 10
In the latter, discrete, case the breaks are different although they have the same effect; the same observations are in the same bins.
回答3:
How about?
a table(Hmisc::cut2(a, m = 10)) [-2.2020,-0.7710) [-0.7710,-0.2352) [-0.2352, 0.0997) [ 0.0997, 0.9775) 10 10 10 10 [ 0.9775, 2.5677] 10
回答4:
Here is a function that handle the error :'breaks' are not unique
, and automatically select the closest n_bins
value to the one you setted up.
equal_freq 1) { n_bins=n_bins-1 res=tryCatch(cut_number(var, n = n_bins), error=function(e) {return (e)}) } if(n_bins_orig != n_bins) warning(sprintf("It's not possible to calculate with n_bins=%s, setting n_bins in: %s.", n_bins_orig, n_bins)) return(res) }
Example:
equal_freq(mtcars$carb, 10)
Which retrieves the binned variable and the following warning:
It's not possible to calculate with n_bins=10, setting n_bins in: 5.
回答5:
Here is a one liner solution inspired by @Joris' answer:
x
回答6:
The classInt library is created "for choosing univariate class intervals for mapping or other graphics purposes". You can just do:
dataset
where 2
is the number of bins you want and the quantile
style
provides quantile breaks. There are several styles
available for this function: "fixed", "sd", "equal", "pretty", "quantile", "kmeans", "hclust", "bclust", "fisher", or "jenks". Check docs for more info.
回答7:
Here's another solution using mltools.
set.seed(1) x