Removing elements from an array that are in another array

后端 未结 5 803
抹茶落季
抹茶落季 2020-11-30 04:06

Say I have these 2D arrays A and B.

How can I remove elements from A that are in B. (Complement in set theory: A-B)

A=np.asarray([[1,1,1], [1,1,2], [         


        
相关标签:
5条回答
  • 2020-11-30 04:34

    Based on this solution to Find the row indexes of several values in a numpy array, here's a NumPy based solution with less memory footprint and could be beneficial when working with large arrays -

    dims = np.maximum(B.max(0),A.max(0))+1
    out = A[~np.in1d(np.ravel_multi_index(A.T,dims),np.ravel_multi_index(B.T,dims))]
    

    Sample run -

    In [38]: A
    Out[38]: 
    array([[1, 1, 1],
           [1, 1, 2],
           [1, 1, 3],
           [1, 1, 4]])
    
    In [39]: B
    Out[39]: 
    array([[0, 0, 0],
           [1, 0, 2],
           [1, 0, 3],
           [1, 0, 4],
           [1, 1, 0],
           [1, 1, 1],
           [1, 1, 4]])
    
    In [40]: out
    Out[40]: 
    array([[1, 1, 2],
           [1, 1, 3]])
    

    Runtime test on large arrays -

    In [107]: def in1d_approach(A,B):
         ...:     dims = np.maximum(B.max(0),A.max(0))+1
         ...:     return A[~np.in1d(np.ravel_multi_index(A.T,dims),\
         ...:                     np.ravel_multi_index(B.T,dims))]
         ...: 
    
    In [108]: # Setup arrays with B as large array and A contains some of B's rows
         ...: B = np.random.randint(0,9,(1000,3))
         ...: A = np.random.randint(0,9,(100,3))
         ...: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)
         ...: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)
         ...: A[A_idx] = B[B_idx]
         ...: 
    

    Timings with broadcasting based solutions -

    In [109]: %timeit A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]
    100 loops, best of 3: 4.64 ms per loop # @Kasramvd's soln
    
    In [110]: %timeit A[~((A[:,None,:] == B).all(-1)).any(1)]
    100 loops, best of 3: 3.66 ms per loop
    

    Timing with less memory footprint based solution -

    In [111]: %timeit in1d_approach(A,B)
    1000 loops, best of 3: 231 µs per loop
    

    Further performance boost

    in1d_approach reduces each row by considering each row as an indexing tuple. We can do the same a bit more efficiently by introducing matrix-multiplication with np.dot, like so -

    def in1d_dot_approach(A,B):
        cumdims = (np.maximum(A.max(),B.max())+1)**np.arange(B.shape[1])
        return A[~np.in1d(A.dot(cumdims),B.dot(cumdims))]
    

    Let's test it against the previous on much larger arrays -

    In [251]: # Setup arrays with B as large array and A contains some of B's rows
         ...: B = np.random.randint(0,9,(10000,3))
         ...: A = np.random.randint(0,9,(1000,3))
         ...: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)
         ...: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)
         ...: A[A_idx] = B[B_idx]
         ...: 
    
    In [252]: %timeit in1d_approach(A,B)
    1000 loops, best of 3: 1.28 ms per loop
    
    In [253]: %timeit in1d_dot_approach(A, B)
    1000 loops, best of 3: 1.2 ms per loop
    
    0 讨论(0)
  • 2020-11-30 04:40

    there is a easy solution with list comprehension,

    A = [i for i in A if i not in B]
    

    Result

    [[1, 1, 2], [1, 1, 3]]
    

    List comprehension it's not removing the elements from the array, It's just reassigning,

    if you want to remove the elements use this method

    for i in B:
         if i in A:
         A.remove(i)
    
    0 讨论(0)
  • 2020-11-30 04:43

    If you want to do it the numpy way,

    import numpy as np
    
    A = np.array([[1, 1, 1,], [1, 1, 2], [1, 1, 3], [1, 1, 4]])
    B = np.array([[0, 0, 0], [1, 0, 2], [1, 0, 3], [1, 0, 4], [1, 1, 0], [1, 1, 1], [1, 1, 4]])
    A_rows = A.view([('', A.dtype)] * A.shape[1])
    B_rows = B.view([('', B.dtype)] * B.shape[1])
    
    diff_array = np.setdiff1d(A_rows, B_rows).view(A.dtype).reshape(-1, A.shape[1])
    

    As @Rahul suggested, for a non numpy easy solution,

    diff_array = [i for i in A if i not in B]
    
    0 讨论(0)
  • 2020-11-30 04:44

    Another non-numpy solution:

    [i for i in A if i not in B]
    
    0 讨论(0)
  • 2020-11-30 04:50

    Here is a Numpythonic approach with broadcasting:

    In [83]: A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]
    Out[83]: 
    array([[1, 1, 2],
           [1, 1, 3]])
    

    Here is a timeit with other answer:

    In [90]: def cal_diff(A, B):
       ....:     A_rows = A.view([('', A.dtype)] * A.shape[1])
       ....:     B_rows = B.view([('', B.dtype)] * B.shape[1])
       ....:     return np.setdiff1d(A_rows, B_rows).view(A.dtype).reshape(-1, A.shape[1])
       ....: 
    
    In [93]: %timeit cal_diff(A, B)
    10000 loops, best of 3: 54.1 µs per loop
    
    In [94]: %timeit A[np.all(np.any((A-B[:, None]), axis=2), axis=0)]
    100000 loops, best of 3: 9.41 µs per loop
    
    # Even better with Divakar's suggestion
    In [97]: %timeit A[~((A[:,None,:] == B).all(-1)).any(1)]
    100000 loops, best of 3: 7.41 µs per loop
    

    Well, if you are looking for a faster way you should looking for ways that reduce the number of comparisons. In this case (without considering the order) you can generate a unique number from your rows and compare the numbers which can be done with summing the items power of two.

    Here is the benchmark with Divakar's in1d approach:

    In [144]: def in1d_approach(A,B):
       .....:         dims = np.maximum(B.max(0),A.max(0))+1
       .....:         return A[~np.in1d(np.ravel_multi_index(A.T,dims),\
       .....:                          np.ravel_multi_index(B.T,dims))]
       .....: 
    
    In [146]: %timeit in1d_approach(A, B)
    10000 loops, best of 3: 23.8 µs per loop
    
    In [145]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]
    10000 loops, best of 3: 20.2 µs per loop
    

    You can use np.diff to get the an order independent result:

    In [194]: B=np.array([[0, 0, 0,], [1, 0, 2,], [1, 0, 3,], [1, 0, 4,], [1, 1, 0,], [1, 1, 1,], [1, 1, 4,], [4, 1, 1]])
    
    In [195]: A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
    Out[195]: 
    array([[1, 1, 2],
           [1, 1, 3]])
    
    In [196]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
    10000 loops, best of 3: 30.7 µs per loop
    

    Benchmark with Divakar's setup:

    In [198]: B = np.random.randint(0,9,(1000,3))
    
    In [199]: A = np.random.randint(0,9,(100,3))
    
    In [200]: A_idx = np.random.choice(np.arange(A.shape[0]),size=10,replace=0)
    
    In [201]: B_idx = np.random.choice(np.arange(B.shape[0]),size=10,replace=0)
    
    In [202]: A[A_idx] = B[B_idx]
    
    In [203]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
    10000 loops, best of 3: 137 µs per loop
    
    In [204]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]
    10000 loops, best of 3: 112 µs per loop
    
    In [205]: %timeit in1d_approach(A, B)
    10000 loops, best of 3: 115 µs per loop
    

    Timing with larger arrays (Divakar's solution is slightly faster):

    In [231]: %timeit A[~np.in1d(np.diff(np.diff(np.power(A, 2))), np.diff(np.diff(np.power(B, 2))))]
    1000 loops, best of 3: 1.01 ms per loop
    
    In [232]: %timeit A[~np.in1d(np.power(A, 2).sum(1), np.power(B, 2).sum(1))]
    1000 loops, best of 3: 880 µs per loop
    
    In [233]:  %timeit in1d_approach(A, B)
    1000 loops, best of 3: 807 µs per loop
    
    0 讨论(0)
提交回复
热议问题