I have two arrays A
(len of 3.8million) and B
(len of 20k).
For the minimal example, lets take this case:
A = np.array([1,1,2,3,3,
Adding to Divakar's answer above -
if the original array A has a wider range than B, that will give you an 'index out of bounds' error. See:
A = np.array([1,1,2,3,3,3,4,5,6,7,8,8,10,12,14])
B = np.array([1,2,8])
A[B[np.searchsorted(B,A)] != A]
>> IndexError: index 3 is out of bounds for axis 0 with size 3
This will happen because np.searchsorted
will assign index 3 (one-past-the-last in B) as the appropriate position for inserting in B the elements 10, 12 and 14 from A, in this example. Thus you get an IndexError in B[np.searchsorted(B,A)]
.
To circumvent that, a possible approach is:
def subset_sorted_array(A,B):
Aa = A[np.where(A <= np.max(B))]
Bb = (B[np.searchsorted(B,Aa)] != Aa)
Bb = np.pad(Bb,(0,A.shape[0]-Aa.shape[0]), method='constant', constant_values=True)
return A[Bb]
Which works as follows:
# Take only the elements in A that would be inserted in B
Aa = A[np.where(A <= np.max(B))]
# Pad the resulting filter with 'Trues' - I split this in two operations for
# easier reading
Bb = (B[np.searchsorted(B,Aa)] != Aa)
Bb = np.pad(Bb,(0,A.shape[0]-Aa.shape[0]), method='constant', constant_values=True)
# Then you can filter A by Bb
A[Bb]
# For the input arrays above:
>> array([ 3, 3, 3, 4, 5, 6, 7, 10, 12, 14])
Notice this will also work between arrays of strings and other types (for all types for which the comparison <=
operator is defined).