python: vectorized cumulative counting

问题

I have a numpy array and would like to count the number of occurences for each value, however, in a cumulative way

in  = [0, 1, 0, 1, 2, 3, 0, 0, 2, 1, 1, 3, 3, 0, ...]
out = [0, 0, 1, 1, 0, 0, 2, 3, 1, 2, 3, 1, 2, 4, ...]

I'm wondering if it is best to create a (sparse) matrix with ones at col = i and row = in[i]

       1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0
       0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0

Then we could compute the cumsums along the rows and extract the numbers from the locations where the cumsums increment.

However, if we cumsum a sparse matrix, doesn't become dense? Is there an efficient way of doing it?

回答1:

Here's one vectorized approach using sorting -

def cumcount(a):
    # Store length of array
    n = len(a)

    # Get sorted indices (use later on too) and store the sorted array
    sidx = a.argsort()
    b = a[sidx]

    # Mask of shifts/groups
    m = b[1:] != b[:-1]

    # Get indices of those shifts
    idx = np.flatnonzero(m)

    # ID array that will store the cumulative nature at the very end
    id_arr = np.ones(n,dtype=int)
    id_arr[idx[1:]+1] = -np.diff(idx)+1
    id_arr[idx[0]+1] = -idx[0]
    id_arr[0] = 0
    c = id_arr.cumsum()

    # Finally re-arrange those cumulative values back to original order
    out = np.empty(n, dtype=int)
    out[sidx] = c
    return out

Sample run -

In [66]: a
Out[66]: array([0, 1, 0, 1, 2, 3, 0, 0, 2, 1, 1, 3, 3, 0])

In [67]: cumcount(a)
Out[67]: array([0, 0, 1, 1, 0, 0, 2, 3, 1, 2, 3, 1, 2, 4])

来源：https://stackoverflow.com/questions/48690864/python-vectorized-cumulative-counting

标签

arrays

numpy

vectorization

counting

cumsum