Numpy grouping using itertools.groupby performance

前端 未结 10 926
庸人自扰
庸人自扰 2020-12-01 03:17

I have many large (>35,000,000) lists of integers that will contain duplicates. I need to get a count for each integer in a list. The following code works, but seems slow. C

相关标签:
10条回答
  • 2020-12-01 03:35

    You could try the following (ab)use of scipy.sparse:

    from scipy import sparse
    def sparse_bincount(values):
        M = sparse.csr_matrix((np.ones(len(values)), values.astype(int), [0, len(values)]))
        M.sum_duplicates()
        index = np.empty(len(M.indices),dtype='u4,u2')
        index['f0'] = M.indices
        index['f1']= M.data
        return index
    

    This is slower than the winning answer, perhaps because scipy currently doesn't support unsigned as indices types...

    0 讨论(0)
  • 2020-12-01 03:43

    By request, here is a Cython version of this. I did two passes through the array. The first one finds out how many unique elements there are so that can my arrays for the unique values and counts of the appropriate size.

    import numpy as np
    cimport numpy as np
    cimport cython
    
    @cython.boundscheck(False)
    def dogroup():
        cdef unsigned long tot = 1
        cdef np.ndarray[np.uint32_t, ndim=1] values = np.array(np.random.randint(35000000,size=35000000),dtype=np.uint32)
        cdef unsigned long i, ind, lastval
        values.sort()
        for i in xrange(1,len(values)):
            if values[i] != values[i-1]:
                tot += 1
        cdef np.ndarray[np.uint32_t, ndim=1] vals = np.empty(tot,dtype=np.uint32)
        cdef np.ndarray[np.uint32_t, ndim=1] count = np.empty(tot,dtype=np.uint32)
        vals[0] = values[0]
        ind = 1
        lastval = 0
        for i in xrange(1,len(values)):
            if values[i] != values[i-1]:
                vals[ind] = values[i]
                count[ind-1] = i - lastval
                lastval = i
                ind += 1
        count[ind-1] = len(values) - lastval
    

    The sorting is actually taking the most time here by far. Using the values array given in my code, the sorting is taking 4.75 seconds and the actual finding of the unique values and counts takes .67 seconds. With the pure Numpy code using Paul's code (but with the same form of the values array) with the fix I suggested in a comment, finding the unique values and counts takes 1.9 seconds (sorting still takes the same amount of time of course).

    It makes sense for most of the time to be taken up by the sorting because it is O(N log N) and the counting is O(N). You can speed up the sort a little bit over Numpy's (which uses C's qsort if I remember correctly), but you have to really know what you are doing and it probably isn't worthwhile. Also, there might be some way to speed up my Cython code a little bit more, but it probably isn't worthwhile.

    0 讨论(0)
  • 2020-12-01 03:44

    In the latest version of numpy, we have this.

    import numpy as np
    frequency = np.unique(values, return_counts=True)
    
    0 讨论(0)
  • 2020-12-01 03:46

    Sorting is theta(NlogN), I'd go for amortized O(N) provided by Python's hashtable implementation. Just use defaultdict(int) for keeping counts of the integers and just iterate over the array once:

    counts = collections.defaultdict(int)
    for v in values:
        counts[v] += 1
    

    This is theoretically faster, unfortunately I have no way to check now. Allocating the additional memory might make it actually slower than your solution, which is in-place.

    Edit: If you need to save memory try radix sort, which is much faster on integers than quicksort (which I believe is what numpy uses).

    0 讨论(0)
提交回复
热议问题