I have the following type of arrays:
a = array([[1,1,1],
[1,1,1],
[1,1,1],
[2,2,2],
[2,2,2],
[2,2,2],
Since numpy-1.13.0
, np.unique
can be used with axis
argument:
>>> np.unique(a, axis=0, return_counts=True)
(array([[1, 1, 1],
[2, 2, 2],
[3, 3, 0]]), array([3, 3, 3]))
collections.Counter
can do this conveniently, and almost like the example given.
>>> from collections import Counter
>>> c = Counter()
>>> for x in a:
... c[tuple(x)] += 1
...
>>> c
Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})
This converts each sub-list to a tuple, which can be keys in a dictionary since they are immutable. Lists are mutable so can't be used as dict keys.
Why do you want to avoid using for loops?
And similar to @padraic-cunningham's much cooler answer:
>>> Counter(tuple(x) for x in a)
Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})
>>> Counter(map(tuple, a))
Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})
If you don't mind mapping to tuples just to get the count you can use a Counter dict which runs in 28.5 µs
on my machine using python3 which is well below your threshold:
In [5]: timeit Counter(map(tuple, a))
10000 loops, best of 3: 28.5 µs per loop
In [6]: c = Counter(map(tuple, a))
In [7]: c
Out[7]: Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})
You could convert those rows to a 1D array using the elements as two-dimensional indices with np.ravel_multi_index. Then, use np.unique to give us the positions of the start of each unique row and also has an optional argument return_counts
to give us the counts. Thus, the implementation would look something like this -
def unique_rows_counts(a):
# Calculate linear indices using rows from a
lidx = np.ravel_multi_index(a.T,a.max(0)+1 )
# Get the unique indices and their counts
_, unq_idx, counts = np.unique(lidx, return_index = True, return_counts=True)
# return the unique groups from a and their respective counts
return a[unq_idx], counts
Sample run -
In [64]: a
Out[64]:
array([[1, 1, 1],
[1, 1, 1],
[1, 1, 1],
[2, 2, 2],
[2, 2, 2],
[2, 2, 2],
[3, 3, 0],
[3, 3, 0],
[3, 3, 0]])
In [65]: unqrows, counts = unique_rows_counts(a)
In [66]: unqrows
Out[66]:
array([[1, 1, 1],
[2, 2, 2],
[3, 3, 0]])
In [67]: counts
Out[67]: array([3, 3, 3])
Assuming you are okay with either numpy arrays or collections as outputs, one can benchmark the solutions provided thus far, like so -
Function definitions:
import numpy as np
from collections import Counter
def unique_rows_counts(a):
lidx = np.ravel_multi_index(a.T,a.max(0)+1 )
_, unq_idx, counts = np.unique(lidx, return_index = True, return_counts=True)
return a[unq_idx], counts
def map_Counter(a):
return Counter(map(tuple, a))
def forloop_Counter(a):
c = Counter()
for x in a:
c[tuple(x)] += 1
return c
Timings:
In [53]: a = np.random.randint(0,4,(10000,5))
In [54]: %timeit map_Counter(a)
10 loops, best of 3: 31.7 ms per loop
In [55]: %timeit forloop_Counter(a)
10 loops, best of 3: 45.4 ms per loop
In [56]: %timeit unique_rows_counts(a)
1000 loops, best of 3: 1.72 ms per loop
The numpy_indexed package (disclaimer: I am its author) contains efficient vectorized functionality for these kind of operations:
import numpy_indexed as npi
unique_rows, row_count = npi.count(a, axis=0)
Note that this works for arrays of any dimensionality or datatype.