If I have two arrays:
X = np.random.rand(10000,2)
Y = np.random.rand(10000,2)
How can I, for each point in X, find out which point in Y is closest to it? So that in the end I have an array showing:
x1_index y_index_of_closest
1 7
2 54
3 3
... ...
I want to do this for both columns in X and compare each to each column and value in Y
This question is pretty popular. Since similar questions keep getting closed and linked here, I think it's worth pointing out that even though the existing answers are quite fast for thousands of data points, they start to break down after that. My potato segfaults at 10k items in each array.
The potential problem with the other answers is the algorithmic complexity. They compare everything in X
to everything in Y
. To get around that, at least on average, we need a better strategy for ruling out some of the things in Y
.
In one dimension this is easy -- just sort everything and start popping out nearest neighbors. In two dimensions there are a variety of strategies, but KD-trees are reasonably popular and are already implemented in the scipy
stack. On my machine, there's a crossover between the various methods around the point where each of X
and Y
have 6k things in them.
from scipy.spatial import KDTree
tree = KDTree(X)
neighbor_dists, neighbor_indices = tree.query(Y)
The extremely poor performance of scipy
's KDTree implementation has been a sore spot of mine for awhile, especially with as many things as have been built on top of it. There are probably data sets where it performs well, but I haven't seen one yet.
If you don't mind an extra dependency, you can get a 1000x speed boost just by switching your KDTree library. The package pykdtree
is pip-installable, and I pretty much guarantee the conda packages work fine too. With this approach, my used, budget chromebook can process X
and Y
with 10 million points each in barely 30 seconds. That beats segfaulting at 10 thousand points with a wall time ;)
from pykdtree.kdtree import KDTree
tree = KDTree(X)
neighbor_dists, neighbor_indices = tree.query(Y)
This has to be the most asked numpy question (I've answered it myself twice in the last week), but since it can be phrased a million ways:
import numpy as np
import scipy.spatial.distance.cdist as cdist
def withScipy(X,Y): # faster
return np.argmin(cdist(X,Y,'sqeuclidean'),axis=0)
def withoutScipy(X,Y): #slower, using broadcasting
return np.argmin(np.sum((X[None,:,:]-Y[:,None,:])**2,axis=-1), axis=0)
There's also a numpy-only method using einsum
that's faster than my function (but not cdist
) but I don't understand it well enough to explain it.
EDIT += 21 months:
The best way to do this algorithmically though is with KDTree.
from sklearn.neighbors import KDTree
# since the sklearn implementation allows return_distance = False, saving memory
y_tree = KDTree(Y)
y_index_of_closest = y_tree.query(X, k = 1, return_distance = False)
@HansMusgrave has a pretty good speedup for KDTree below.
And for completion's sake, the np.einsum
answer, which I now understand:
np.argmin( # (X - Y) ** 2
np.einsum('ij, ij ->i', X, X)[:, None] + # = X ** 2 \
np.einsum('ij, ij ->i', Y, Y) - # + Y ** 2 \
2 * X.dot(Y.T), # - 2 * X * Y
axis = 1)
@Divakar explains this method well on the wiki page of his package eucl_dist
来源:https://stackoverflow.com/questions/41102645/for-every-point-in-an-array-find-the-closest-point-to-it-in-a-second-array-and