How to apply euclidean distance function to a groupby object in pandas dataframe?

问题

I have a set of objects and their positions over time. I would like to get the average distance between objects for each time point. An example dataframe is as follows:

time = [0, 0, 0, 1, 1, 2, 2]
x = [216, 218, 217, 280, 290, 130, 132]
y = [13, 12, 12, 110, 109, 3, 56]
car = [1, 2, 3, 1, 3, 4, 5]
df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})
df

             x       y      car
     time
      0     216     13       1
      0     218     12       2
      0     217     12       3
      1     280     110      1
      1     290     109      3
      2     130     3        4
      2     132     56       5

The end result I would like to have is:

df2

              average distance
              between cars       
     time
      0           1.55     
      1           10.05     
      2           53.04

any idea on how to proceed? I've been trying apply the scipy.spatial.distance function to the dataframe, but I'm not sure how to apply it to df.groupby('time'), and then get the mean value of all those distances. Any help appreciated!

回答1:

You could pass an array of the points to scipy.spatial.distaince.pdist and it will calculate all pair-wise distances between Xi and Xj for i>j. Then take the mean.

import numpy as np
from scipy import spatial

df.groupby('time').apply(lambda x: spatial.distance.pdist(np.array(list(zip(x.x, x.y)))).mean())

Outputs:

time
0     1.550094
1    10.049876
2    53.037722
dtype: float64

回答2:

For me using apply or for loop does not have much different

l1=[]
l2=[]

for y,x in df.groupby('time'):
    v=np.triu(spatial.distance.cdist(x[['x','y']].values, x[['x','y']].values),k=0)

    v = np.ma.masked_equal(v, 0)
    l2.append(np.mean(v))
    l1.append(y)


pd.DataFrame({'ave':l2},index=l1)

Out[250]: 
         ave
0   1.550094
1  10.049876
2  53.037722

回答3:

building this up from the first principles:

For each point at index n, it is necessary to compute the distance with all the points with index > n.

if the distance between two points is given by formula:

np.sqrt((x0 - x1)**2 + (y0 - y1)**2)

then for an array of points in a dataframe, we can get all the distances & then calculate its mean:

distances = []
for i in range(len(df)-1):
    distances += np.sqrt( (df.x[i+1:] - df.x[i])**2 + (df.y[i+1:] - df.y[i])**2 ).tolist()

np.mean(distances)

expressing the same logic using pd.concat & a couple of helper functions

def diff_sq(x, i):
    return (x.iloc[i+1:] - x.iloc[i])**2

def dist_df(x, y, i):
    d_sq = diff_sq(x, i) + diff_sq(y, i)
    return np.sqrt(d_sq)

def avg_dist(df):
    return pd.concat([dist_df(df.x, df.y, i) for i in range(len(df)-1)]).mean()

then it is possible to use the avg_dist function with groupby

df.groupby('time').apply(avg_dist)
# outputs:
time
0     1.550094
1    10.049876
2    53.037722
dtype: float64

回答4:

You could also use the itertools package to define your own function as follow:

 import itertools
 import numpy as np

 def combinations(series):
        l = list()
        for item in itertools.combinations(series,2):
            l.append(((item[0] - item[1])**2))
        return l

df2 = df.groupby('time').agg(combinations)
df2['avg_distance'] = [np.mean(np.sqrt(pd.Series(df2.iloc[k,0]) + 
pd.Series(df2.iloc[k,1]))) for k in range(len(df2))]

df2.avg_distance.to_frame()

Then, the output is:

    avg_distance
time    
0   1.550094
1   10.049876
2   53.037722

来源：https://stackoverflow.com/questions/51064346/how-to-apply-euclidean-distance-function-to-a-groupby-object-in-pandas-dataframe

标签

python

pandas

dataframe

euclidean-distance