Remove elements from one array if present in another array, keep duplicates - NumPy / Python

前端未结

关注

 3  1054

I have two arrays A (len of 3.8million) and B (len of 20k). For the minimal example, lets take this case:

A = np.array([1,1,2,3,3,


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  有刺的猬        
                
              
                            
                2020-12-17 21:47
              
            
            
                                                                       
Adding to Divakar's answer above -
if the original array A has a wider range than B, that will give you an 'index out of bounds' error. See:
A = np.array([1,1,2,3,3,3,4,5,6,7,8,8,10,12,14])
B = np.array([1,2,8])

A[B[np.searchsorted(B,A)] !=  A]
>> IndexError: index 3 is out of bounds for axis 0 with size 3


This will happen because np.searchsorted will assign index 3 (one-past-the-last in B) as the appropriate position for inserting in B the elements 10, 12 and 14 from A, in this example. Thus you get an IndexError in B[np.searchsorted(B,A)].
To circumvent that, a possible approach is:
def subset_sorted_array(A,B):
    Aa = A[np.where(A <= np.max(B))]
    Bb = (B[np.searchsorted(B,Aa)] !=  Aa)
    Bb = np.pad(Bb,(0,A.shape[0]-Aa.shape[0]), method='constant', constant_values=True)
    return A[Bb]

Which works as follows:
# Take only the elements in A that would be inserted in B
Aa = A[np.where(A <= np.max(B))]

# Pad the resulting filter with 'Trues' - I split this in two operations for
# easier reading
Bb = (B[np.searchsorted(B,Aa)] !=  Aa)
Bb = np.pad(Bb,(0,A.shape[0]-Aa.shape[0]),  method='constant', constant_values=True)

# Then you can filter A by Bb
A[Bb]
# For the input arrays above:
>> array([ 3,  3,  3,  4,  5,  6,  7, 10, 12, 14])

Notice this will also work between arrays of strings and other types (for all types for which the comparison <= operator is defined).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情书的邮戳        
                
              
                            
                2020-12-17 21:52
              
            
            
                                                                       
I am not very familiar with numpy, but how about using sets:

C = set(A.flat) - set(B.flat)


EDIT : from comments, sets cannot have duplicates values.

So another solution would be to use a lambda expression : 

C = np.array(list(filter(lambda x: x not in B, A)))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一生所求        
                
              
                            
                2020-12-17 21:53
              
            
            
                                                                       
Using searchsorted
With sorted B, we can use searchsorted -
A[B[np.searchsorted(B,A)] !=  A]

From the linked docs, searchsorted(a,v) find the indices into a sorted array a such that, if the corresponding elements in v were inserted before the indices, the order of a would be preserved. So, let's say idx = searchsorted(B,A) and we index into B with those : B[idx], we will get a mapped version of B corresponding to every element in A. Thus, comparing this mapped version against A would tell us for every element in A if there's a match in B or not. Finally, index into A to select the non-matching ones.
Generic case (B is not sorted) :
If B is not already sorted as is the pre-requisite, sort it and then use the proposed method.
Alternatively, we can use sorter argument with searchsorted -
sidx = B.argsort()
out = A[B[sidx[np.searchsorted(B,A,sorter=sidx)]] != A]

More generic case (A  has values higher than ones in B) :
sidx = B.argsort()
idx = np.searchsorted(B,A,sorter=sidx)
idx[idx==len(B)] = 0
out = A[B[sidx[idx]] != A]


Using in1d/isin
We can also use np.in1d, which is pretty straight-forward (the linked docs should help clarify) as it looks for any match in  B for every element in A and then we can use boolean-indexing with an inverted mask to look for non-matching ones -
A[~np.in1d(A,B)]

Same with isin -
A[~np.isin(A,B)]

With invert flag -
A[np.in1d(A,B,invert=True)]

A[np.isin(A,B,invert=True)]

This solves for a generic when B is not necessarily sorted.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复

Remove elements from one array if present in another array, keep duplicates - NumPy / Python

Using `searchsorted`

Using `in1d/isin`