How to apply euclidean distance function to a groupby object in pandas dataframe?

做~自己de王妃 提交于 2019-12-24 14:03:40

问题


I have a set of objects and their positions over time. I would like to get the average distance between objects for each time point. An example dataframe is as follows:

time = [0, 0, 0, 1, 1, 2, 2]
x = [216, 218, 217, 280, 290, 130, 132]
y = [13, 12, 12, 110, 109, 3, 56]
car = [1, 2, 3, 1, 3, 4, 5]
df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})
df

             x       y      car
     time
      0     216     13       1
      0     218     12       2
      0     217     12       3
      1     280     110      1
      1     290     109      3
      2     130     3        4
      2     132     56       5

The end result I would like to have is:

df2

              average distance
              between cars       
     time
      0           1.55     
      1           10.05     
      2           53.04    

any idea on how to proceed? I've been trying apply the scipy.spatial.distance function to the dataframe, but I'm not sure how to apply it to df.groupby('time'), and then get the mean value of all those distances. Any help appreciated!


回答1:


You could pass an array of the points to scipy.spatial.distaince.pdist and it will calculate all pair-wise distances between Xi and Xj for i>j. Then take the mean.

import numpy as np
from scipy import spatial

df.groupby('time').apply(lambda x: spatial.distance.pdist(np.array(list(zip(x.x, x.y)))).mean())

Outputs:

time
0     1.550094
1    10.049876
2    53.037722
dtype: float64



回答2:


For me using apply or for loop does not have much different

l1=[]
l2=[]

for y,x in df.groupby('time'):
    v=np.triu(spatial.distance.cdist(x[['x','y']].values, x[['x','y']].values),k=0)

    v = np.ma.masked_equal(v, 0)
    l2.append(np.mean(v))
    l1.append(y)


pd.DataFrame({'ave':l2},index=l1)

Out[250]: 
         ave
0   1.550094
1  10.049876
2  53.037722



回答3:


building this up from the first principles:

For each point at index n, it is necessary to compute the distance with all the points with index > n.

if the distance between two points is given by formula:

np.sqrt((x0 - x1)**2 + (y0 - y1)**2)

then for an array of points in a dataframe, we can get all the distances & then calculate its mean:

distances = []
for i in range(len(df)-1):
    distances += np.sqrt( (df.x[i+1:] - df.x[i])**2 + (df.y[i+1:] - df.y[i])**2 ).tolist()

np.mean(distances)

expressing the same logic using pd.concat & a couple of helper functions

def diff_sq(x, i):
    return (x.iloc[i+1:] - x.iloc[i])**2

def dist_df(x, y, i):
    d_sq = diff_sq(x, i) + diff_sq(y, i)
    return np.sqrt(d_sq)

def avg_dist(df):
    return pd.concat([dist_df(df.x, df.y, i) for i in range(len(df)-1)]).mean()

then it is possible to use the avg_dist function with groupby

df.groupby('time').apply(avg_dist)
# outputs:
time
0     1.550094
1    10.049876
2    53.037722
dtype: float64



回答4:


You could also use the itertools package to define your own function as follow:

 import itertools
 import numpy as np

 def combinations(series):
        l = list()
        for item in itertools.combinations(series,2):
            l.append(((item[0] - item[1])**2))
        return l

df2 = df.groupby('time').agg(combinations)
df2['avg_distance'] = [np.mean(np.sqrt(pd.Series(df2.iloc[k,0]) + 
pd.Series(df2.iloc[k,1]))) for k in range(len(df2))]

df2.avg_distance.to_frame()

Then, the output is:

    avg_distance
time    
0   1.550094
1   10.049876
2   53.037722


来源:https://stackoverflow.com/questions/51064346/how-to-apply-euclidean-distance-function-to-a-groupby-object-in-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!