Find unique rows in numpy.array

后端 未结 20 2848
独厮守ぢ
独厮守ぢ 2020-11-21 10:57

I need to find unique rows in a numpy.array.

For example:

>>> a # I have
array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
         


        
相关标签:
20条回答
  • 2020-11-21 11:47

    If you want to avoid the memory expense of converting to a series of tuples or another similar data structure, you can exploit numpy's structured arrays.

    The trick is to view your original array as a structured array where each item corresponds to a row of the original array. This doesn't make a copy, and is quite efficient.

    As a quick example:

    import numpy as np
    
    data = np.array([[1, 1, 1, 0, 0, 0],
                     [0, 1, 1, 1, 0, 0],
                     [0, 1, 1, 1, 0, 0],
                     [1, 1, 1, 0, 0, 0],
                     [1, 1, 1, 1, 1, 0]])
    
    ncols = data.shape[1]
    dtype = data.dtype.descr * ncols
    struct = data.view(dtype)
    
    uniq = np.unique(struct)
    uniq = uniq.view(data.dtype).reshape(-1, ncols)
    print uniq
    

    To understand what's going on, have a look at the intermediary results.

    Once we view things as a structured array, each element in the array is a row in your original array. (Basically, it's a similar data structure to a list of tuples.)

    In [71]: struct
    Out[71]:
    array([[(1, 1, 1, 0, 0, 0)],
           [(0, 1, 1, 1, 0, 0)],
           [(0, 1, 1, 1, 0, 0)],
           [(1, 1, 1, 0, 0, 0)],
           [(1, 1, 1, 1, 1, 0)]],
          dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])
    
    In [72]: struct[0]
    Out[72]:
    array([(1, 1, 1, 0, 0, 0)],
          dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])
    

    Once we run numpy.unique, we'll get a structured array back:

    In [73]: np.unique(struct)
    Out[73]:
    array([(0, 1, 1, 1, 0, 0), (1, 1, 1, 0, 0, 0), (1, 1, 1, 1, 1, 0)],
          dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])
    

    That we then need to view as a "normal" array (_ stores the result of the last calculation in ipython, which is why you're seeing _.view...):

    In [74]: _.view(data.dtype)
    Out[74]: array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0])
    

    And then reshape back into a 2D array (-1 is a placeholder that tells numpy to calculate the correct number of rows, give the number of columns):

    In [75]: _.reshape(-1, ncols)
    Out[75]:
    array([[0, 1, 1, 1, 0, 0],
           [1, 1, 1, 0, 0, 0],
           [1, 1, 1, 1, 1, 0]])
    

    Obviously, if you wanted to be more concise, you could write it as:

    import numpy as np
    
    def unique_rows(data):
        uniq = np.unique(data.view(data.dtype.descr * data.shape[1]))
        return uniq.view(data.dtype).reshape(-1, data.shape[1])
    
    data = np.array([[1, 1, 1, 0, 0, 0],
                     [0, 1, 1, 1, 0, 0],
                     [0, 1, 1, 1, 0, 0],
                     [1, 1, 1, 0, 0, 0],
                     [1, 1, 1, 1, 1, 0]])
    print unique_rows(data)
    

    Which results in:

    [[0 1 1 1 0 0]
     [1 1 1 0 0 0]
     [1 1 1 1 1 0]]
    
    0 讨论(0)
  • 2020-11-21 11:49

    We can actually turn m x n numeric numpy array into m x 1 numpy string array, please try using the following function, it provides count, inverse_idx and etc, just like numpy.unique:

    import numpy as np
    
    def uniqueRow(a):
        #This function turn m x n numpy array into m x 1 numpy array storing 
        #string, and so the np.unique can be used
    
        #Input: an m x n numpy array (a)
        #Output unique m' x n numpy array (unique), inverse_indx, and counts 
    
        s = np.chararray((a.shape[0],1))
        s[:] = '-'
    
        b = (a).astype(np.str)
    
        s2 = np.expand_dims(b[:,0],axis=1) + s + np.expand_dims(b[:,1],axis=1)
    
        n = a.shape[1] - 2    
    
        for i in range(0,n):
             s2 = s2 + s + np.expand_dims(b[:,i+2],axis=1)
    
        s3, idx, inv_, c = np.unique(s2,return_index = True,  return_inverse = True, return_counts = True)
    
        return a[idx], inv_, c
    

    Example:

    A = np.array([[ 3.17   9.502  3.291],
      [ 9.984  2.773  6.852],
      [ 1.172  8.885  4.258],
      [ 9.73   7.518  3.227],
      [ 8.113  9.563  9.117],
      [ 9.984  2.773  6.852],
      [ 9.73   7.518  3.227]])
    
    B, inv_, c = uniqueRow(A)
    
    Results:
    
    B:
    [[ 1.172  8.885  4.258]
    [ 3.17   9.502  3.291]
    [ 8.113  9.563  9.117]
    [ 9.73   7.518  3.227]
    [ 9.984  2.773  6.852]]
    
    inv_:
    [3 4 1 0 2 4 0]
    
    c:
    [2 1 1 1 2]
    
    0 讨论(0)
  • 2020-11-21 11:52

    For general purpose like 3D or higher multidimensional nested arrays, try this:

    import numpy as np
    
    def unique_nested_arrays(ar):
        origin_shape = ar.shape
        origin_dtype = ar.dtype
        ar = ar.reshape(origin_shape[0], np.prod(origin_shape[1:]))
        ar = np.ascontiguousarray(ar)
        unique_ar = np.unique(ar.view([('', origin_dtype)]*np.prod(origin_shape[1:])))
        return unique_ar.view(origin_dtype).reshape((unique_ar.shape[0], ) + origin_shape[1:])
    

    which satisfies your 2D dataset:

    a = np.array([[1, 1, 1, 0, 0, 0],
           [0, 1, 1, 1, 0, 0],
           [0, 1, 1, 1, 0, 0],
           [1, 1, 1, 0, 0, 0],
           [1, 1, 1, 1, 1, 0]])
    unique_nested_arrays(a)
    

    gives:

    array([[0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])
    

    But also 3D arrays like:

    b = np.array([[[1, 1, 1], [0, 1, 1]],
                  [[0, 1, 1], [1, 1, 1]],
                  [[1, 1, 1], [0, 1, 1]],
                  [[1, 1, 1], [1, 1, 1]]])
    unique_nested_arrays(b)
    

    gives:

    array([[[0, 1, 1], [1, 1, 1]],
       [[1, 1, 1], [0, 1, 1]],
       [[1, 1, 1], [1, 1, 1]]])
    
    0 讨论(0)
  • 2020-11-21 11:54

    np.unique when I run it on np.random.random(100).reshape(10,10) returns all the unique individual elements, but you want the unique rows, so first you need to put them into tuples:

    array = #your numpy array of lists
    new_array = [tuple(row) for row in array]
    uniques = np.unique(new_array)
    

    That is the only way I see you changing the types to do what you want, and I am not sure if the list iteration to change to tuples is okay with your "not looping through"

    0 讨论(0)
  • 2020-11-21 11:55

    Another option to the use of structured arrays is using a view of a void type that joins the whole row into a single item:

    a = np.array([[1, 1, 1, 0, 0, 0],
                  [0, 1, 1, 1, 0, 0],
                  [0, 1, 1, 1, 0, 0],
                  [1, 1, 1, 0, 0, 0],
                  [1, 1, 1, 1, 1, 0]])
    
    b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
    _, idx = np.unique(b, return_index=True)
    
    unique_a = a[idx]
    
    >>> unique_a
    array([[0, 1, 1, 1, 0, 0],
           [1, 1, 1, 0, 0, 0],
           [1, 1, 1, 1, 1, 0]])
    

    EDIT Added np.ascontiguousarray following @seberg's recommendation. This will slow the method down if the array is not already contiguous.

    EDIT The above can be slightly sped up, perhaps at the cost of clarity, by doing:

    unique_a = np.unique(b).view(a.dtype).reshape(-1, a.shape[1])
    

    Also, at least on my system, performance wise it is on par, or even better, than the lexsort method:

    a = np.random.randint(2, size=(10000, 6))
    
    %timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
    100 loops, best of 3: 3.17 ms per loop
    
    %timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
    100 loops, best of 3: 5.93 ms per loop
    
    a = np.random.randint(2, size=(10000, 100))
    
    %timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
    10 loops, best of 3: 29.9 ms per loop
    
    %timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
    10 loops, best of 3: 116 ms per loop
    
    0 讨论(0)
  • 2020-11-21 11:56

    Here is another variation for @Greg pythonic answer

    np.vstack(set(map(tuple, a)))
    
    0 讨论(0)
提交回复
热议问题