问题
I have a set of objects and their positions over time. I would like to get the distance between each car and their nearest neighbour, and calculate an average of this for each time point. An example dataframe is as follows:
time = [0, 0, 0, 1, 1, 2, 2]
x = [216, 218, 217, 280, 290, 130, 132]
y = [13, 12, 12, 110, 109, 3, 56]
car = [1, 2, 3, 1, 3, 4, 5]
df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})
df
x y car
time
0 216 13 1
0 218 12 2
0 217 12 3
1 280 110 1
1 290 109 3
2 130 3 4
2 132 56 5
For each time point, I would like to know the nearest car neighbour for each car. Example:
df2
car nearest_neighbour euclidean_distance
time
0 1 3 1.41
0 2 3 1.00
0 3 1 1.41
1 1 3 10.05
1 3 1 10.05
2 4 5 53.04
2 5 4 53.04
I know I can caluclate the pairwise distances between cars from How to apply euclidean distance function to a groupby object in pandas dataframe? but how do I get the nearest neighbour for each car?
After that it seems simple enough to get an average of the distances for each frame using groupby, but its the second step that really throws me off. Help appreciated!
回答1:
It might be a bit overkill but you could use nearest neighbors from scikit
An example:
import numpy as np
from sklearn.neighbors import NearestNeighbors
import pandas as pd
def nn(x):
nbrs = NearestNeighbors(n_neighbors=2, algorithm='auto', metric='euclidean').fit(x)
distances, indices = nbrs.kneighbors(x)
return distances, indices
time = [0, 0, 0, 1, 1, 2, 2]
x = [216, 218, 217, 280, 290, 130, 132]
y = [13, 12, 12, 110, 109, 3, 56]
car = [1, 2, 3, 1, 3, 4, 5]
df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})
#This has the index of the nearest neighbor in the group, as well as the distance
nns = df.drop('car', 1).groupby('time').apply(lambda x: nn(x.as_matrix()))
groups = df.groupby('time')
nn_rows = []
for i, nn_set in enumerate(nns):
group = groups.get_group(i)
for j, tup in enumerate(zip(nn_set[0], nn_set[1])):
nn_rows.append({'time': i,
'car': group.iloc[j]['car'],
'nearest_neighbour': group.iloc[tup[1][1]]['car'],
'euclidean_distance': tup[0][1]})
nn_df = pd.DataFrame(nn_rows).set_index('time')
Result:
car euclidean_distance nearest_neighbour
time
0 1 1.414214 3
0 2 1.000000 3
0 3 1.000000 2
1 1 10.049876 3
1 3 10.049876 1
2 4 53.037722 5
2 5 53.037722 4
(Note that at time 0, car 3's nearest neighbor is car 2. sqrt((217-216)**2 + 1)
is about 1.4142135623730951
while sqrt((218-217)**2 + 0) = 1
)
回答2:
use cdist from scipy.spatial.distance to get a matrix representing distance from each car to every other car. Since each car's distance to itself is 0, the diagonal elements are all 0.
example (for time == 0
):
X = df[df.time==0][['x','y']]
dist = cdist(X, X)
dist
array([[0. , 2.23606798, 1.41421356],
[2.23606798, 0. , 1. ],
[1.41421356, 1. , 0. ]])
Use np.argsort to get the indexes that would sort the distance-matrix. The first column is just the row number because the diagonal elements are 0.
idx = np.argsort(dist)
idx
array([[0, 2, 1],
[1, 2, 0],
[2, 1, 0]], dtype=int64)
Then, just pick out the cars & closest distances using the idx
dist[v[:,0], v[:,1]]
array([1.41421356, 1. , 1. ])
df[df.time==0].car.values[v[:,1]]
array([3, 3, 2], dtype=int64)
combine the above logic into a function that returns the required dataframe:
def closest(df):
X = df[['x', 'y']]
dist = cdist(X, X)
v = np.argsort(dist)
return df.assign(euclidean_distance=dist[v[:, 0], v[:, 1]],
nearest_neighbour=df.car.values[v[:, 1]])
& use it with groupby, finally dropping the index because the groupby-apply adds an additional index
df.groupby('time').apply(closest).reset_index(drop=True)
time x y car euclidean_distance nearest_neighbour
0 0 216 13 1 1.414214 3
1 0 218 12 2 1.000000 3
2 0 217 12 3 1.000000 2
3 1 280 110 1 10.049876 3
4 1 290 109 3 10.049876 1
5 2 130 3 4 53.037722 5
6 2 132 56 5 53.037722 4
by the way your sample output is wrong for time 0. My answer & Bacon's answer both show the correct result
来源:https://stackoverflow.com/questions/51305370/calculating-average-distance-of-nearest-neighbours-in-pandas-dataframe