I need to find unique rows in a numpy.array
.
For example:
>>> a # I have
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
np.unique works by sorting a flattened array, then looking at whether each item is equal to the previous. This can be done manually without flattening:
ind = np.lexsort(a.T)
a[ind[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]]
This method does not use tuples, and should be much faster and simpler than other methods given here.
NOTE: A previous version of this did not have the ind right after a[, which mean that the wrong indices were used. Also, Joe Kington makes a good point that this does make a variety of intermediate copies. The following method makes fewer, by making a sorted copy and then using views of it:
b = a[np.lexsort(a.T)]
b[np.concatenate(([True], np.any(b[1:] != b[:-1],axis=1)))]
This is faster and uses less memory.
Also, if you want to find unique rows in an ndarray regardless of how many dimensions are in the array, the following will work:
b = a[lexsort(a.reshape((a.shape[0],-1)).T)];
b[np.concatenate(([True], np.any(b[1:]!=b[:-1],axis=tuple(range(1,a.ndim)))))]
An interesting remaining issue would be if you wanted to sort/unique along an arbitrary axis of an arbitrary-dimension array, something that would be more difficult.
Edit:
To demonstrate the speed differences, I ran a few tests in ipython of the three different methods described in the answers. With your exact a, there isn't too much of a difference, though this version is a bit faster:
In [87]: %timeit unique(a.view(dtype)).view('
With a larger a, however, this version ends up being much, much faster:
In [96]: a = np.random.randint(0,2,size=(10000,6))
In [97]: %timeit unique(a.view(dtype)).view('