Python: how to make an histogram with equally *sized* bins

情到浓时终转凉″ 提交于 2019-12-03 13:03:47

Using your example case (bins of 2 points, 6 total data points):

from scipy import stats
bin_edges = stats.mstats.mquantiles(data, [0, 2./6, 4./6, 1])
>> array([1. , 1.24666667, 2.05333333, 2.12])

I would like to mention also the existence of pandas.qcut, which does equi-populated binning in quite an efficient way. In your case it would work something like

data = np.array([1., 1.2, 1.3, 2.0, 2.1, 2.12])
# parameter q specifies the number of bins
qc = pd.qcut(data, q=3, precision=1)

# bin definition
bins  = qc.categories
print(bins)
>> Index(['[1, 1.3]', '(1.3, 2.03]', '(2.03, 2.1]'], dtype='object')

# bin corresponding to each point in data
codes = qc.codes
print(codes)
>> array([0, 0, 1, 1, 2, 2], dtype=int8)

Update for skewed distributions :

I came across the same problem as @astabada, wanting to create bins each containing an equal number of samples. When applying the solution proposed @aganders3, I found that it didn't work particularly well for skewed distributions. In the case of skewed data (for example something with a whole lot of zeros), stats.mstats.mquantiles for a predefined number of quantiles will not guarantee an equal number of samples in each bin. You will get bin edges that look like this :

[0. 0. 4. 9.]

In which case the first bin will be empty.

In order to deal with skewed cases, I created a function that calls stats.mstats.mquantiles and then dynamically modifies the number of bins if samples are not equal within a certain tolerance (30% of the smallest sample size in the example code). If samples are not equal between bins, the code reduces the number of equally-spaced quantiles by 1 and calls stats.mstats.mquantiles again until sample sizes are equal or only one bin exists.

I hard coded the tolerance in the example, but this could be modified to a keyword argument if desired.

I also prefer giving the number of equally spaced quantiles as an argument to my function instead of giving user defined quantiles to stats.mstats.mquantiles in order to reduce accidental errors (i.e. something like [0., 0.25, 0.7, 1.]).

Here's the code :

import numpy as np 
from scipy import stats

def equibins(dat, binnum, **kwargs):
    numin = binnum
    while numin>1.:
        qtls = np.linspace(0.,1.0,num=numin,endpoint=False)
        ebins =stats.mstats.mquantiles(dat,qtls,alphap=kwargs['alpha'],betap=kwargs['beta'])
        allhist, allbin   = np.histogram(dat, bins = ebins)
        if (np.unique(ebins).shape!=ebins.shape or tolerence(allhist,0.3)==False) and numin>2:
            numin= numin-1
            del qtls, ebins
        else:
            numin=0
    return ebins

def tolerence(narray, percent):
    if percent>1.0:
        per = percent/100.
    else:
        per = percent
    lev_tol  = per*narray.min()
    tolerate = np.all(narray[1:]-narray[0]<lev_tol)
    return tolerate

Just sort the data, and divide it into fixed bins by length! Obviously you can never divide into exactly equally populated bins, if the number of samples does not divide exactly by the number of bins.

import math
import numpy as np
data = np.array([2,3,5,6,8,5,5,6,3,2,3,7,8,9,8,6,6,8,9,9,0,7,5,3,3,4,5,6,7])
data_sorted = np.sort(data)
nbins = 3
step = math.ceil(len(data_sorted)//nbins+1)
binned_data = []
for i in range(0,len(data_sorted),step):
    binned_data.append(data_sorted[i:i+step])
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!