Binning and then combining bins with minimum number of observations?

廉价感情. 提交于 2020-01-03 04:15:10

问题


Let's say I create some data and then create bins of different sizes:

from __future__ import division
x = np.random.rand(1,20)
new, = np.digitize(x,np.arange(1,x.shape[1]+1)/100)
new_series = pd.Series(new)
print(new_series.value_counts())

reveals:

20 17
16 1
4  1
2  1
dtype: int64

I basically want to transform the underlying data, if I set a minimum threshold of at least 2 per bin, so that new_series.value_counts() is this:

20 17
16 3
dtype: int64

回答1:


EDITED:

x = np.random.rand(1,100)
bins = np.arange(1,x.shape[1]+1)/100

new = np.digitize(x,bins)
n = new.copy()[0] # this will hold the the result

threshold = 2

for i in np.unique(n):
    if sum(n == i) <= threshold:
        n[n == i] += 1

n.clip(0, bins.size) # avoid adding beyond the last bin
n = n.reshape(1,-1)

This can move counts up multiple times, until a bin is filled sufficiently.

Instead of using np.digitize, it might be simpler to use np.histogram instead, because it will directly give you the counts, so that we don't need to sum ourselves.



来源:https://stackoverflow.com/questions/38591000/binning-and-then-combining-bins-with-minimum-number-of-observations

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!