Removing duplicates from a list of numPy arrays

喜夏-厌秋 提交于 2019-12-29 08:44:11

问题


I have an ordinary Python list that contains (multidimensional) numPy arrays, all of the same shape and with the same number of values. Some of the arrays in the list are duplicates of earlier ones.

I have the problem that I want to remove all the duplicates, but the fact that the data type is numPy arrays complicates this a bit...

• I can't use set() as numPy arrays are not hashable.
• I can't check for duplicates during insertion, as the arrays are generated in batches by a function and added to the list with .extend().
• numPy arrays aren't directly comparable without resorting to one of numPy's own functions, so I can't just go something that uses "if x in list"...
• The contents of the list need to remain numPy arrays at the end of the process; I could compare copies of the arrays converted to nested lists, but I can't convert the arrays to straight python lists permanently.

Any suggestions on how I can remove duplicates efficiently here?


回答1:


Using the solutions here: Most efficient property to hash for numpy array we see that hashing works best with a.tostring() if a is an numpy array. So:

import numpy as np
arraylist = [np.array([1,2,3,4]), np.array([1,2,3,4]), np.array([1,3,2,4])]
L = {array.tostring(): array for array in arraylist}
L.values() # [array([1, 3, 2, 4]), array([1, 2, 3, 4])]



回答2:


Depending on the structure of your data, it may be quicker to directly compare all the arrays rather than finding some way to hash the arrays. The algorithm is O(n^2), but each individual comparison wil be much quicker than creating strings or python lists of your arrays. So it depends how many arrays you have to check.

eg.

uniques = []
for arr in possible_duplicates:
    if not any(numpy.array_equal(arr, unique_arr) for unique_arr in uniques):
        uniques.append(arr)



回答3:


Here is one way using tuple:

>>> import numpy as np
>>> t = [np.asarray([1, 2, 3, 4]), 
         np.asarray([1, 2, 3, 4]), 
         np.asarray([1, 1, 3, 4])]

>>> map(np.asarray, set(map(tuple, t)))
[array([1, 1, 3, 4]), array([1, 2, 3, 4])]

If your arrays are multidimensional, then first flatten them to a 1-by-whatever array, then use the same idea, and reshape them at the end:

def to_tuple(arr):
    return tuple(arr.reshape((arr.size,)))

def from_tuple(tup, original_shape):
    np.asarray(tup).reshape(original_shape)

Example:

In [64]: t = np.asarray([[[1,2,3],[4,5,6]],
                         [[1,1,3],[4,5,6]],
                         [[1,2,3],[4,5,6]]])

In [65]: map(lambda x: from_tuple(x, t[0].shape), set(map(to_tuple, t)))
Out[65]: 
[array([[1, 2, 3],
        [4, 5, 6]]), 
 array([[1, 1, 3],
        [4, 5, 6]])]

Another option is to create a pandas.DataFrame out of your ndarrays (treating them as rows by reshaping if needed) and using the pandas built-ins for uniquifying the rows.

In [34]: t
Out[34]: [array([1, 2, 3, 4]), array([1, 2, 3, 4]), array([1, 1, 3, 4])]

In [35]: pandas.DataFrame(t).drop_duplicates().values
Out[35]: 
array([[1, 2, 3, 4],
       [1, 1, 3, 4]])

Overall, I think it's a bad idea to try to use tostring() as a quasi hash function, because you'll need more boiler plate code than in my approach just to protect against the possibility that some of the contents are mutated after they've been assigned their "hash" key in some dict.

If the reshaping and converting to tuple is too slow given the size of the data, my feeling is that this reveals a more fundamental problem: the application isn't designed well around the needs (like de-duping) and trying to cram them into some Python process running in memory is probably not the right way. At that point, I would stop to consider whether something like Cassandra, which can easily build database indices on top of large columns (or multidimensional arrays) of floating point (or other) data isn't the more sane approach.



来源:https://stackoverflow.com/questions/27751072/removing-duplicates-from-a-list-of-numpy-arrays

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!