Given a Series
, I would like to efficiently compute how many observations have passed since there was a change. Here is a simple example:
ser = pd.
Here's one NumPy approach -
def array_cumcount(a):
idx = np.flatnonzero(a[1:] != a[:-1])+1
shift_arr = np.ones(a.size,dtype=int)
shift_arr[0] = 0
if len(idx)>=1:
shift_arr[idx[0]] = -idx[0]+1
shift_arr[idx[1:]] = -idx[1:] + idx[:-1] + 1
return shift_arr.cumsum()
Sample run -
In [583]: ser = pd.Series([1.2,1.2,1.2,1.2,2,2,2,4,3,3,3,3])
In [584]: array_cumcount(ser.values)
Out[584]: array([0, 1, 2, 3, 0, 1, 2, 0, 0, 1, 2, 3])
Runtime test -
In [601]: ser = pd.Series(np.random.randint(0,3,(10000)))
# @Psidom's soln
In [602]: %timeit ser.groupby(ser).cumcount()
1000 loops, best of 3: 729 µs per loop
In [603]: %timeit array_cumcount(ser.values)
10000 loops, best of 3: 85.3 µs per loop
In [604]: ser = pd.Series(np.random.randint(0,3,(1000000)))
# @Psidom's soln
In [605]: %timeit ser.groupby(ser).cumcount()
10 loops, best of 3: 30.1 ms per loop
In [606]: %timeit array_cumcount(ser.values)
100 loops, best of 3: 11.7 ms per loop
You can use groupby.cumcount
:
ser.groupby(ser).cumcount()
#0 0
#1 1
#2 2
#3 3
#4 0
#5 1
#6 2
#7 0
#8 0
#dtype: int64