How to apply euclidean distance function to a groupby object in pandas dataframe?

后端 未结 4 474
佛祖请我去吃肉
佛祖请我去吃肉 2021-01-20 06:03

I have a set of objects and their positions over time. I would like to get the average distance between objects for each time point. An example dataframe is as follows:

相关标签:
4条回答
  • 2021-01-20 06:35

    You could also use the itertools package to define your own function as follow:

     import itertools
     import numpy as np
    
     def combinations(series):
            l = list()
            for item in itertools.combinations(series,2):
                l.append(((item[0] - item[1])**2))
            return l
    
    df2 = df.groupby('time').agg(combinations)
    df2['avg_distance'] = [np.mean(np.sqrt(pd.Series(df2.iloc[k,0]) + 
    pd.Series(df2.iloc[k,1]))) for k in range(len(df2))]
    
    df2.avg_distance.to_frame()
    

    Then, the output is:

        avg_distance
    time    
    0   1.550094
    1   10.049876
    2   53.037722
    
    0 讨论(0)
  • 2021-01-20 06:46

    building this up from the first principles:

    For each point at index n, it is necessary to compute the distance with all the points with index > n.

    if the distance between two points is given by formula:

    np.sqrt((x0 - x1)**2 + (y0 - y1)**2)
    

    then for an array of points in a dataframe, we can get all the distances & then calculate its mean:

    distances = []
    for i in range(len(df)-1):
        distances += np.sqrt( (df.x[i+1:] - df.x[i])**2 + (df.y[i+1:] - df.y[i])**2 ).tolist()
    
    np.mean(distances)
    

    expressing the same logic using pd.concat & a couple of helper functions

    def diff_sq(x, i):
        return (x.iloc[i+1:] - x.iloc[i])**2
    
    def dist_df(x, y, i):
        d_sq = diff_sq(x, i) + diff_sq(y, i)
        return np.sqrt(d_sq)
    
    def avg_dist(df):
        return pd.concat([dist_df(df.x, df.y, i) for i in range(len(df)-1)]).mean()
    

    then it is possible to use the avg_dist function with groupby

    df.groupby('time').apply(avg_dist)
    # outputs:
    time
    0     1.550094
    1    10.049876
    2    53.037722
    dtype: float64
    
    0 讨论(0)
  • 2021-01-20 06:51

    For me using apply or for loop does not have much different

    l1=[]
    l2=[]
    
    for y,x in df.groupby('time'):
        v=np.triu(spatial.distance.cdist(x[['x','y']].values, x[['x','y']].values),k=0)
    
        v = np.ma.masked_equal(v, 0)
        l2.append(np.mean(v))
        l1.append(y)
    
    
    pd.DataFrame({'ave':l2},index=l1)
    
    Out[250]: 
             ave
    0   1.550094
    1  10.049876
    2  53.037722
    
    0 讨论(0)
  • 2021-01-20 06:58

    You could pass an array of the points to scipy.spatial.distaince.pdist and it will calculate all pair-wise distances between Xi and Xj for i>j. Then take the mean.

    import numpy as np
    from scipy import spatial
    
    df.groupby('time').apply(lambda x: spatial.distance.pdist(np.array(list(zip(x.x, x.y)))).mean())
    

    Outputs:

    time
    0     1.550094
    1    10.049876
    2    53.037722
    dtype: float64
    
    0 讨论(0)
提交回复
热议问题