问题
I have a Pandas DataFrame, where columns X1, Y1 have point coordinates for the first group of coordinates and columns X2, Y2 have point coordinates for the second group of coordinates. Both groups are independent of each other. It is just happen to be they are in the same dataframe. Example:
X1,Y1,X2,Y2
41246.438,0.49,38791.673,0.49
41304.5,0.491,38921.557,0.491
41392.062,0.492,39037.135,0.492
41515.5,0.493,39199.972,0.493
41636.062,0.494,39346.561,0.494
41795.188,0.495,39477.63,0.495
42027.75,0.496,39576.275,0.496
42252.25,0.497,39732.102,0.497
42486.812,0.498,39833.753,0.498
42739.062,0.499,39949.13,0.499
43012.125,0.5,40135.42,0.5
43472.75,0.5,40292.017,0.5
43909.562,0.501,40479.452,0.501
44312.625,0.502,40725.329,0.502
44799.938,0.503,40950.05,0.503
45294.938,0.504,41214.136,0.504
45729.625,0.505,41514.213,0.505
45942.438,0.506,41943.208,0.506
46067.688,0.507,42296.643,0.507
46215,0.508,42653.477,0.508
46336.75,0.509,43138.834,0.509
46476.562,0.51,43557.815,0.51
46584.25,0.511,43966.564,0.511
46654.75,0.512,44166.996,0.512
46707.75,0.513,44310.557,0.513
46774.188,0.514,44410.069,0.514
46832.062,0.515,44518.045,0.515
46905.062,0.516,44608.646,0.516
46976.562,0.517,44678.073,0.517
47077.938,0.518,44727.393,0.518
47215.688,0.519,44786.498,0.519
47290.625,0.52,44845.867,0.52
47351.5,0.521,44915.072,0.521
For each point in columns X1, Y1 I need to find a point in column X2, Y2 such that the Euclidean distance between these two points is the shortest.
As an outcome I need to place that found point from columns X2, Y2 in the same row as the corresponding point in X1, Y1. Also I need to augment to the same row the computed shortest Euclidean distance in another column D. Then repeat this process for each point in columns X1, Y1.
One way to do this is to iterate rows in columns X1, Y1, and for each row find shortest Euclidean distance in columns X2, Y2. There are may be better ways to do it without writing for loops.
回答1:
Solution
Use Faiss.
pip install faiss
Instead of IndexFlatL2 you can use slightly faster IndexIVFFlat that allows you to approximate results.
import faiss
def get_closest(df: pd.DataFrame)->pd.DataFrame:
d = 2 # dimensionality
xb = np.float32(df[["X2","Y2"]].values)
xb = np.ascontiguousarray(xb)
xq = np.float32(df[["X1","Y1"]].values)
xq = np.ascontiguousarray(xq)
index = faiss.IndexFlatL2(d) # build the index
index.add(xb) # add vectors to the index
D, I = index.search(xq, 1) # actual search
res_df = df[["X1","Y1"]]
res_df[["X2","Y2"]] = df[["X2","Y2"]].iloc[I[:,0]].reset_index(drop = True)
res_df["distance"] = D[:,0]
return res_df
get_closest(df)
Performance
For 1e4 (x,y) pairs in both sets - running time:
371 ms ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For 1e5 vectors
33.9 s ± 3.55 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
That should be similar to generating full distances matrix, using scipy, or NumPy, but it's much more efficient in terms of memory usage, and do not require a further search on this matrix.
Notice
- In the function above - for
res_df
I'm setting it to be a slice ofdf
that's not recommended since changes you are making inres_df
will affectdf
. That's made for lower memory usage, if you want to avoid unpredictable behaviour you can make a copy. - In case if you need more than 1 neighbour for every point - it's very easy to achieve with faiss with minimum modifications.
Alternatives
Using KDTree
import pandas as pd
from scipy.spatial import KDTree
def get_closest(df: pd.DataFrame)->pd.DataFrame:
tree = KDTree(df[["X1", "Y1"]].values)
dist, ind = tree.query(df[["X2", "Y2"]].values, k=1) # k desired number of neighbors
res_df = df[["X1","Y1"]]
res_df[["X2","Y2"]] = df[["X2","Y2"]].iloc[ind].reset_index(drop = True)
res_df["distance"] = dist
return res_df
get_closest(df)
For 1e4 (x,y) pairs in both sets - running time:
1.43 s ± 55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For 1e5 (x,y) pairs in both sets - running time:
17 s ± 767 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using cdist
, proposed by @Dimon
df[['X2','Y2']] = \
df[['X2','Y2']].iloc[np.argmin(cdist(df[['X1','Y1']], df[['X2','Y2']],
metric='euclidean' ), axis=1),:].copy().reset_index(drop=True)
df['D'] = np.linalg.norm(df[['X1','Y1']].values - df[['X2','Y2']].values, axis=1)
For 1e4 (x,y) pairs in both sets - running time:
543 ms ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For 1e5 (x,y) pairs in both sets - running time:
MemoryError: Unable to allocate 74.5 GiB for an array with shape (100000, 100000) and data type float64
Using numpy
, as proposed by @Valdi_Bo
diffs = df.iloc[:, 2:].values[np.newaxis, :, :]\
- df.iloc[:, :2].values[:, np.newaxis, :]
diffs2 = (diffs ** 2).sum(axis=2)
result = pd.Series(np.sqrt(diffs2.min(axis=0)), name='minDist')
diffs2.argmin(axis=0)
For 1e4 (x,y) pairs in both sets - running time:
1.6 s ± 82.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For 1e5 (x,y) pairs in both sets - running time:
MemoryError: Unable to allocate 149. GiB for an array with shape (100000, 100000, 2) and data type float64
回答2:
I will show you how to compute the result based solely on Numpy.
The first step is to compute differences along each coordinate, between each "X1 / Y1" point and each "X2 / Y2" point:
diffs = df.iloc[:, 2:].values[np.newaxis, :, :]\
- df.iloc[:, :2].values[:, np.newaxis, :]
Then compute squares of these differences and sum them (for each pair of points):
diffs2 = (diffs ** 2).sum(axis=2)
And the last step is to compute the result by:
- finding minimal square distance from each "X2 / Y2" point,
- compute the root from it (for each point),
- convert to a Series.
The code to do it is:
result = pd.Series(np.sqrt(diffs2.min(axis=0)), name='minDist')
Additionally, if you want to know which "X1 / Y1" point is the closest to the given "X2 / Y2" point, run:
diffs2.argmin(axis=0)
For your data it is:
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,
6, 7, 9, 10, 11, 12, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14],
dtype=int64)
回答3:
Here is alternative solution based on cdist:
from io import StringIO
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist
df = \
'''
X1,Y1,X2,Y2
41246.438,0.49,38791.673,0.49
41304.5,0.491,38921.557,0.491
41392.062,0.492,39037.135,0.492
41515.5,0.493,39199.972,0.493
41636.062,0.494,39346.561,0.494
41795.188,0.495,39477.63,0.495
42027.75,0.496,39576.275,0.496
42252.25,0.497,39732.102,0.497
42486.812,0.498,39833.753,0.498
42739.062,0.499,39949.13,0.499
43012.125,0.5,40135.42,0.5
43472.75,0.5,40292.017,0.5
43909.562,0.501,40479.452,0.501
44312.625,0.502,40725.329,0.502
44799.938,0.503,40950.05,0.503
45294.938,0.504,41214.136,0.504
45729.625,0.505,41514.213,0.505
45942.438,0.506,41943.208,0.506
46067.688,0.507,42296.643,0.507
46215,0.508,42653.477,0.508
46336.75,0.509,43138.834,0.509
46476.562,0.51,43557.815,0.51
46584.25,0.511,43966.564,0.511
46654.75,0.512,44166.996,0.512
46707.75,0.513,44310.557,0.513
46774.188,0.514,44410.069,0.514
46832.062,0.515,44518.045,0.515
46905.062,0.516,44608.646,0.516
46976.562,0.517,44678.073,0.517
47077.938,0.518,44727.393,0.518
47215.688,0.519,44786.498,0.519
47290.625,0.52,44845.867,0.52
47351.5,0.521,44915.072,0.521
'''
df = pd.read_csv(StringIO(df), sep=",")
print(df)
df[['X2','Y2']] = \
df[['X2','Y2']].iloc[np.argmin(cdist(df[['X1','Y1']], df[['X2','Y2']],
metric='euclidean' ), axis=1),:].copy().reset_index(drop=True)
df['D'] = np.linalg.norm(df[['X1','Y1']].values - df[['X2','Y2']].values, axis=1)
print(df)
来源:https://stackoverflow.com/questions/64014127/find-all-shortest-euclidean-distances-between-two-groups-of-point-coordinates