I\'m looking for a fast solution to MATLAB\'s accumarray in numpy. The accumarray
accumulates the elements of an array which belong to the same index. An exampl
Not as good as the accepted answer but:
[np.sum([a[x] for x in y]) for y in [list(np.where(accmap==z)) for z in np.unique(accmap).tolist()]]
This takes 108us per loop
(100000 loops, best of 3)
The accepted answer (np.bincount(accmap, weights=a
) takes 2.05us per loop
(100000 loops, best of 3)
How about the following:
import numpy
def accumarray(a, accmap):
ordered_indices = numpy.argsort(accmap)
ordered_accmap = accmap[ordered_indices]
_, sum_indices = numpy.unique(ordered_accmap, return_index=True)
cumulative_sum = numpy.cumsum(a[ordered_indices])[sum_indices-1]
result = numpy.empty(len(sum_indices), dtype=a.dtype)
result[:-1] = cumulative_sum[1:]
result[-1] = cumulative_sum[0]
result[1:] = result[1:] - cumulative_sum[1:]
return result
Late to the party, but...
As @Jamie says, for the case of summing, np.bincount
is fast and simple. However in the more general case, for other ufuncs
such as maximum
, you can use the np.ufunc.at method.
I've put together a gist[see link below instead] which encapsulates this in a Matlab-like interface. It also takes advantage of the repeated indexing rules to provide a 'last'
and 'first'
function, and unlike Matlab, 'mean'
is sensibly optimized (calling accumarray
with @mean
in Matlab is really slow because it runs a non-builtin function for every single group, which is stupid).
Be warned that I haven't particularly tested the gist, but will hopefully update it in future with extra features and bugfixes.
Update May/June-2015: I have reworked my implementation - it is now available as part of ml31415/numpy-groupies and available on PyPi (pip install numpy-groupies
). Benchmarks are as follows (see github repo for up-to-date values)...
function pure-py np-grouploop np-ufuncat np-optimised pandas ratio
std 1737.8ms 171.8ms no-impl 7.0ms no-impl 247.1: 24.4: - : 1.0 : -
all 1280.8ms 62.2ms 41.8ms 6.6ms 550.7ms 193.5: 9.4 : 6.3 : 1.0 : 83.2
min 1358.7ms 59.6ms 42.6ms 42.7ms 24.5ms 55.4: 2.4 : 1.7 : 1.7 : 1.0
max 1538.3ms 55.9ms 38.8ms 37.5ms 18.8ms 81.9: 3.0 : 2.1 : 2.0 : 1.0
sum 1532.8ms 62.6ms 40.6ms 1.9ms 20.4ms 808.5: 33.0: 21.4: 1.0 : 10.7
var 1756.8ms 146.2ms no-impl 6.3ms no-impl 279.1: 23.2: - : 1.0 : -
prod 1448.8ms 55.2ms 39.9ms 38.7ms 20.2ms 71.7: 2.7 : 2.0 : 1.9 : 1.0
any 1399.5ms 69.1ms 41.1ms 5.7ms 558.8ms 246.2: 12.2: 7.2 : 1.0 : 98.3
mean 1321.3ms 88.3ms no-impl 4.0ms 20.9ms 327.6: 21.9: - : 1.0 : 5.2
Python 2.7.9, Numpy 1.9.2, Win7 Core i7.
Here we are using 100,000
indices uniformly picked from [0, 1000)
. Specifically, about 25% of the values are 0
(for use with bool operations), the remainder are uniformly distribuited on [-50,25)
. Timings are shown for 10 repeats.
itertools.groupby
. numpy
to sort values based on idx
, then uses split
to create separate arrays, and then loops over these arrays, running the relevant numpy
function for each array.numpy
ufunc.at
method, which is slower than it ought to be - as disuccsed in an issue I created on numpy's github repo.numpy
indexing/other tricks to beat the above two implementations (except for min max prod
which rely on ufunc.at
).pd.DataFrame({'idx':idx, 'vals':vals}).groupby('idx').sum()
etc.Note that some of the no-impl
s may be unwarranted, but I haven't bothered to get them working yet.
As explained on github, accumarray
now supports nan
-prefixed functions (e.g. nansum
) as well as, sort
, rsort
, and array
. It also works with multidimensional indexing.
I've written an accumarray implementation with scipy.weave
and uploaded it at github: https://github.com/ml31415/numpy-groupies
It depends on what exactly you are trying to do, but numpy unique has a bunch of optional outputs that you can use to accumulate. If your array has several identical values, then unique will count how many of the identical values there are by setting the return_counts option to true. In some easy applications, this is all that you need to do.
numpy.unique(ar, return_index=False, return_inverse=False, return_counts=True, axis=None)
You can also set the index to true and use it to accumulate a different array.
You can do this with pandas DataFrame in one line.
In [159]: df = pd.DataFrame({"y":np.arange(1,11),"x":[0,1,0,0,0,1,1,2,2,1]})
In [160]: df
Out[160]:
x y
0 0 1
1 1 2
2 0 3
3 0 4
4 0 5
5 1 6
6 1 7
7 2 8
8 2 9
9 1 10
In [161]: pd.pivot_table(df,values='y',index='x',aggfunc=sum)
Out[161]:
y
x
0 13
1 25
2 17
You can tell the pivot_table
to use specific columns as indices and values, and get a new DataFrame object. When you specify the aggregation function as the sum the results will be identical to Matlab's accumarray.