calculating average distance of nearest neighbours in pandas dataframe

问题

I have a set of objects and their positions over time. I would like to get the distance between each car and their nearest neighbour, and calculate an average of this for each time point. An example dataframe is as follows:

 time = [0, 0, 0, 1, 1, 2, 2]
 x = [216, 218, 217, 280, 290, 130, 132]
 y = [13, 12, 12, 110, 109, 3, 56]
 car = [1, 2, 3, 1, 3, 4, 5]
 df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})
 df

         x       y      car
 time
  0     216     13       1
  0     218     12       2
  0     217     12       3
  1     280     110      1
  1     290     109      3
  2     130     3        4
  2     132     56       5

For each time point, I would like to know the nearest car neighbour for each car. Example:

df2

          car    nearest_neighbour    euclidean_distance  
 time
  0       1            3                    1.41
  0       2            3                    1.00
  0       3            1                    1.41
  1       1            3                    10.05
  1       3            1                    10.05
  2       4            5                    53.04
  2       5            4                    53.04

I know I can caluclate the pairwise distances between cars from How to apply euclidean distance function to a groupby object in pandas dataframe? but how do I get the nearest neighbour for each car?

After that it seems simple enough to get an average of the distances for each frame using groupby, but its the second step that really throws me off. Help appreciated!

回答1:

It might be a bit overkill but you could use nearest neighbors from scikit

An example:

import numpy as np 
from sklearn.neighbors import NearestNeighbors
import pandas as pd

def nn(x):
    nbrs = NearestNeighbors(n_neighbors=2, algorithm='auto', metric='euclidean').fit(x)
    distances, indices = nbrs.kneighbors(x)
    return distances, indices

time = [0, 0, 0, 1, 1, 2, 2]
x = [216, 218, 217, 280, 290, 130, 132]
y = [13, 12, 12, 110, 109, 3, 56] 
car = [1, 2, 3, 1, 3, 4, 5]
df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})

#This has the index of the nearest neighbor in the group, as well as the distance
nns = df.drop('car', 1).groupby('time').apply(lambda x: nn(x.as_matrix()))

groups = df.groupby('time')
nn_rows = []
for i, nn_set in enumerate(nns):
    group = groups.get_group(i)
    for j, tup in enumerate(zip(nn_set[0], nn_set[1])):
        nn_rows.append({'time': i,
                        'car': group.iloc[j]['car'],
                        'nearest_neighbour': group.iloc[tup[1][1]]['car'],
                        'euclidean_distance': tup[0][1]})

nn_df = pd.DataFrame(nn_rows).set_index('time')

Result:

      car  euclidean_distance  nearest_neighbour
time                                            
0       1            1.414214                  3
0       2            1.000000                  3
0       3            1.000000                  2
1       1           10.049876                  3
1       3           10.049876                  1
2       4           53.037722                  5
2       5           53.037722                  4

(Note that at time 0, car 3's nearest neighbor is car 2. sqrt((217-216)**2 + 1) is about 1.4142135623730951 while sqrt((218-217)**2 + 0) = 1)

回答2:

use cdist from scipy.spatial.distance to get a matrix representing distance from each car to every other car. Since each car's distance to itself is 0, the diagonal elements are all 0.

example (for time == 0):

X = df[df.time==0][['x','y']]
dist = cdist(X, X)
dist
array([[0.        , 2.23606798, 1.41421356],
       [2.23606798, 0.        , 1.        ],
       [1.41421356, 1.        , 0.        ]])

Use np.argsort to get the indexes that would sort the distance-matrix. The first column is just the row number because the diagonal elements are 0.

idx = np.argsort(dist)
idx
array([[0, 2, 1],
       [1, 2, 0],
       [2, 1, 0]], dtype=int64)

Then, just pick out the cars & closest distances using the idx

dist[v[:,0], v[:,1]]
array([1.41421356, 1.        , 1.        ])

df[df.time==0].car.values[v[:,1]]
array([3, 3, 2], dtype=int64)

combine the above logic into a function that returns the required dataframe:

 def closest(df):
     X = df[['x', 'y']]
     dist = cdist(X, X)
     v = np.argsort(dist)
     return df.assign(euclidean_distance=dist[v[:, 0], v[:, 1]],
                      nearest_neighbour=df.car.values[v[:, 1]])

& use it with groupby, finally dropping the index because the groupby-apply adds an additional index

df.groupby('time').apply(closest).reset_index(drop=True)

   time    x    y  car  euclidean_distance  nearest_neighbour
0     0  216   13    1            1.414214                  3
1     0  218   12    2            1.000000                  3
2     0  217   12    3            1.000000                  2
3     1  280  110    1           10.049876                  3
4     1  290  109    3           10.049876                  1
5     2  130    3    4           53.037722                  5
6     2  132   56    5           53.037722                  4

by the way your sample output is wrong for time 0. My answer & Bacon's answer both show the correct result

来源：https://stackoverflow.com/questions/51305370/calculating-average-distance-of-nearest-neighbours-in-pandas-dataframe

标签

python

pandas

dataframe

euclidean-distance