Efficiently find row intersections of two 2-D numpy arrays

后端未结

关注

 3  929

感情败类

I am trying to figure out an efficient way of finding row intersections of two np.arrays.

Two arrays have the same shapes, and duplicate values in each

相关标签:

3条回答

忘掉有多难

2021-01-05 17:22

If you have no duplicates within a row you can try to replicate what np.intersect1d does under the hood (see the source code here):

>>> c = np.hstack((a, b))
>>> c
array([[2, 5, 6, 2, 3, 4],
       [8, 2, 3, 7, 4, 3],
       [4, 1, 5, 5, 4, 1],
       [1, 7, 9, 7, 6, 9]])
>>> c.sort(axis=1)
>>> c
array([[2, 2, 3, 4, 5, 6],
       [2, 3, 3, 4, 7, 8],
       [1, 1, 4, 4, 5, 5],
       [1, 6, 7, 7, 9, 9]])
>>> c[:, 1:] == c[:, :-1]
array([[ True, False, False, False, False],
       [False,  True, False, False, False],
       [ True, False,  True, False,  True],
       [False, False,  True, False,  True]], dtype=bool)
>>> np.sum(c[:, 1:] == c[:, :-1], axis=1)
array([1, 1, 3, 2])

0 讨论(0)

悲哀的现实

2021-01-05 17:33
I can't think of a clean pure-numpy solution, but the following suggestion should speed things up, potentially dramatically:
1. use numba. It is as simple as decorating your get_intersect1ds function with @autojit
2. pass assume_unique = True when you call intersect1d
0 讨论(0)
发布评论:

提交评论
- 加载中...
别跟我提以往

2021-01-05 17:39
This answer might not be viable, because if the input has shape (N, M), it generates an intermediate array with size (N, M, M), but it's always fun to see what you can do with broadcasting:
```
In [43]: a
Out[43]: 
array([[2, 5, 6],
       [8, 2, 3],
       [4, 1, 5],
       [1, 7, 9]])

In [44]: b
Out[44]: 
array([[2, 3, 4],
       [7, 4, 3],
       [5, 4, 1],
       [7, 6, 9]])

In [45]: (np.expand_dims(a, -1) == np.expand_dims(b, 1)).sum(axis=-1).sum(axis=-1)
Out[45]: array([1, 1, 3, 2])
```
For large arrays, the method could be made more memory-friendly by doing the operation in batches.
0 讨论(0)
发布评论:

提交评论
- 加载中...