Removing duplicates from a list of numPy arrays

问题

I have an ordinary Python list that contains (multidimensional) numPy arrays, all of the same shape and with the same number of values. Some of the arrays in the list are duplicates of earlier ones.

I have the problem that I want to remove all the duplicates, but the fact that the data type is numPy arrays complicates this a bit...

• I can't use set() as numPy arrays are not hashable.
• I can't check for duplicates during insertion, as the arrays are generated in batches by a function and added to the list with .extend().
• numPy arrays aren't directly comparable without resorting to one of numPy's own functions, so I can't just go something that uses "if x in list"...
• The contents of the list need to remain numPy arrays at the end of the process; I could compare copies of the arrays converted to nested lists, but I can't convert the arrays to straight python lists permanently.

Any suggestions on how I can remove duplicates efficiently here?

回答1:

Using the solutions here: Most efficient property to hash for numpy array we see that hashing works best with a.tostring() if a is an numpy array. So:

import numpy as np
arraylist = [np.array([1,2,3,4]), np.array([1,2,3,4]), np.array([1,3,2,4])]
L = {array.tostring(): array for array in arraylist}
L.values() # [array([1, 3, 2, 4]), array([1, 2, 3, 4])]

回答2:

Depending on the structure of your data, it may be quicker to directly compare all the arrays rather than finding some way to hash the arrays. The algorithm is O(n^2), but each individual comparison wil be much quicker than creating strings or python lists of your arrays. So it depends how many arrays you have to check.

eg.

uniques = []
for arr in possible_duplicates:
    if not any(numpy.array_equal(arr, unique_arr) for unique_arr in uniques):
        uniques.append(arr)

回答3:

Here is one way using tuple:

>>> import numpy as np
>>> t = [np.asarray([1, 2, 3, 4]), 
         np.asarray([1, 2, 3, 4]), 
         np.asarray([1, 1, 3, 4])]

>>> map(np.asarray, set(map(tuple, t)))
[array([1, 1, 3, 4]), array([1, 2, 3, 4])]

If your arrays are multidimensional, then first flatten them to a 1-by-whatever array, then use the same idea, and reshape them at the end:

def to_tuple(arr):
    return tuple(arr.reshape((arr.size,)))

def from_tuple(tup, original_shape):
    np.asarray(tup).reshape(original_shape)

Example:

In [64]: t = np.asarray([[[1,2,3],[4,5,6]],
                         [[1,1,3],[4,5,6]],
                         [[1,2,3],[4,5,6]]])

In [65]: map(lambda x: from_tuple(x, t[0].shape), set(map(to_tuple, t)))
Out[65]: 
[array([[1, 2, 3],
        [4, 5, 6]]), 
 array([[1, 1, 3],
        [4, 5, 6]])]

Another option is to create a pandas.DataFrame out of your ndarrays (treating them as rows by reshaping if needed) and using the pandas built-ins for uniquifying the rows.

In [34]: t
Out[34]: [array([1, 2, 3, 4]), array([1, 2, 3, 4]), array([1, 1, 3, 4])]

In [35]: pandas.DataFrame(t).drop_duplicates().values
Out[35]: 
array([[1, 2, 3, 4],
       [1, 1, 3, 4]])

Overall, I think it's a bad idea to try to use tostring() as a quasi hash function, because you'll need more boiler plate code than in my approach just to protect against the possibility that some of the contents are mutated after they've been assigned their "hash" key in some dict.

If the reshaping and converting to tuple is too slow given the size of the data, my feeling is that this reveals a more fundamental problem: the application isn't designed well around the needs (like de-duping) and trying to cram them into some Python process running in memory is probably not the right way. At that point, I would stop to consider whether something like Cassandra, which can easily build database indices on top of large columns (or multidimensional arrays) of floating point (or other) data isn't the more sane approach.

来源：https://stackoverflow.com/questions/27751072/removing-duplicates-from-a-list-of-numpy-arrays

标签

python

arrays

list

numpy

duplicate-removal