Efficiently get indices of histogram bins in Python

前端未结

关注

 5  1183

Short Question

I have a large 10000x10000 elements image, which I bin into a few hundred different sectors/bins. I then need to perform some iterative calculation

相关标签:

5条回答

陌清茗

2020-12-13 19:54
Pandas has a very fast grouping code (I think it's written in C), so if you don't mind loading the library you could do that :
```
import pandas as pd

pdata=pd.DataFrame({'vals':vals,'ind':ind})
resultsp = pdata.groupby('ind').sum().values
```
or more generally :
```
pdata=pd.DataFrame({'vals':vals,'ind':ind})
resultsp = pdata.groupby('ind').agg(func).values
```
Although the latter is slower for standard aggregation functions (like sum, mean, etc)
0 讨论(0)
发布评论:

提交评论
- 加载中...
忘掉有多难

2020-12-13 20:10
One efficient solution is using the numpy_indexed package (disclaimer: I am its author):
```
import numpy_indexed as npi
npi.group_by(ind).split(vals)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

醉梦人生

2020-12-13 20:13

You can halve the computation time by sorting the array first, then use np.searchsorted.

vals = np.random.random(1e8)
vals.sort()

nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

results = [func(vals[np.searchsorted(ind,j,side='left'):
                     np.searchsorted(ind,j,side='right')])
           for j in range(1,nbins)]

Using 1e8 as my test case, I go from 34 seconds of computation to about 17.

0 讨论(0)

独厮守ぢ

2020-12-13 20:15
I found that a particular sparse matrix constructor can achieve the desired result very efficiently. It's a bit obscure but we can abuse it for this purpose. The function below can be used in nearly the same way as scipy.stats.binned_statistic but can be orders of magnitude faster
```
import numpy as np
from scipy.sparse import csr_matrix

def binned_statistic(x, values, func, nbins, range):
    '''The usage is nearly the same as scipy.stats.binned_statistic''' 

    N = len(values)
    r0, r1 = range

    digitized = (float(nbins)/(r1 - r0)*(x - r0)).astype(int)
    S = csr_matrix((values, [digitized, np.arange(N)]), shape=(nbins, N))

    return [func(group) for group in np.split(S.data, S.indptr[1:-1])]
```
I avoided np.digitize because it doesn't use the fact that all bins are equal width and hence is slow, but the method I used instead may not handle all edge cases perfectly.
0 讨论(0)
发布评论:

提交评论
- 加载中...

盖世英雄少女心

2020-12-13 20:16

I assume that the binning, done in the example with digitize, cannot be changed. This is one way to go, where you do the sorting once and for all.

vals = np.random.random(1e4)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

new_order = argsort(ind)
ind = ind[new_order]
ordered_vals = vals[new_order]
# slower way of calculating first_hit (first version of this post)
# _,first_hit = unique(ind,return_index=True)
# faster way:
first_hit = searchsorted(ind,arange(1,nbins-1))
first_hit.sort()

#example of using the data:
for j in range(nbins-1):
    #I am using a plotting function for your f, to show that they cluster
    plot(ordered_vals[first_hit[j]:first_hit[j+1]],'o')

The figure shows that the bins are actually clusters as expected: enter image description here

0 讨论(0)