Efficiently get indices of histogram bins in Python

前端 未结 5 1183
清歌不尽
清歌不尽 2020-12-13 19:12

Short Question

I have a large 10000x10000 elements image, which I bin into a few hundred different sectors/bins. I then need to perform some iterative calculation

相关标签:
5条回答
  • 2020-12-13 19:54

    Pandas has a very fast grouping code (I think it's written in C), so if you don't mind loading the library you could do that :

    import pandas as pd
    
    pdata=pd.DataFrame({'vals':vals,'ind':ind})
    resultsp = pdata.groupby('ind').sum().values
    

    or more generally :

    pdata=pd.DataFrame({'vals':vals,'ind':ind})
    resultsp = pdata.groupby('ind').agg(func).values
    

    Although the latter is slower for standard aggregation functions (like sum, mean, etc)

    0 讨论(0)
  • 2020-12-13 20:10

    One efficient solution is using the numpy_indexed package (disclaimer: I am its author):

    import numpy_indexed as npi
    npi.group_by(ind).split(vals)
    
    0 讨论(0)
  • 2020-12-13 20:13

    You can halve the computation time by sorting the array first, then use np.searchsorted.

    vals = np.random.random(1e8)
    vals.sort()
    
    nbins = 100
    bins = np.linspace(0, 1, nbins+1)
    ind = np.digitize(vals, bins)
    
    results = [func(vals[np.searchsorted(ind,j,side='left'):
                         np.searchsorted(ind,j,side='right')])
               for j in range(1,nbins)]
    

    Using 1e8 as my test case, I go from 34 seconds of computation to about 17.

    0 讨论(0)
  • 2020-12-13 20:15

    I found that a particular sparse matrix constructor can achieve the desired result very efficiently. It's a bit obscure but we can abuse it for this purpose. The function below can be used in nearly the same way as scipy.stats.binned_statistic but can be orders of magnitude faster

    import numpy as np
    from scipy.sparse import csr_matrix
    
    def binned_statistic(x, values, func, nbins, range):
        '''The usage is nearly the same as scipy.stats.binned_statistic''' 
    
        N = len(values)
        r0, r1 = range
    
        digitized = (float(nbins)/(r1 - r0)*(x - r0)).astype(int)
        S = csr_matrix((values, [digitized, np.arange(N)]), shape=(nbins, N))
    
        return [func(group) for group in np.split(S.data, S.indptr[1:-1])]
    

    I avoided np.digitize because it doesn't use the fact that all bins are equal width and hence is slow, but the method I used instead may not handle all edge cases perfectly.

    0 讨论(0)
  • 2020-12-13 20:16

    I assume that the binning, done in the example with digitize, cannot be changed. This is one way to go, where you do the sorting once and for all.

    vals = np.random.random(1e4)
    nbins = 100
    bins = np.linspace(0, 1, nbins+1)
    ind = np.digitize(vals, bins)
    
    new_order = argsort(ind)
    ind = ind[new_order]
    ordered_vals = vals[new_order]
    # slower way of calculating first_hit (first version of this post)
    # _,first_hit = unique(ind,return_index=True)
    # faster way:
    first_hit = searchsorted(ind,arange(1,nbins-1))
    first_hit.sort()
    
    #example of using the data:
    for j in range(nbins-1):
        #I am using a plotting function for your f, to show that they cluster
        plot(ordered_vals[first_hit[j]:first_hit[j+1]],'o')
    

    The figure shows that the bins are actually clusters as expected: enter image description here

    0 讨论(0)
提交回复
热议问题