Distance calculation between rows in Pandas Dataframe using a distance matrix

前端 未结 3 869
小鲜肉
小鲜肉 2020-12-31 10:04

I have the following Pandas DataFrame:

In [31]:
import pandas as pd
sample = pd.DataFrame({\'Sym1\': [\'a\',\'a\',\'a\',\'d\'],\'Sym2\':[\'a\',\'c\',\'b\',\'         


        
相关标签:
3条回答
  • 2020-12-31 10:37

    For a large data, I found a fast way to do this. Assume your data is already in np.array format, named as a.

    from sklearn.metrics.pairwise import euclidean_distances
    dist = euclidean_distances(a, a)
    

    Below is an experiment to compare the time needed for two approaches:

    a = np.random.rand(1000,1000)
    import time 
    time1 = time.time()
    distances = pdist(a, metric='euclidean')
    dist_matrix = squareform(distances)
    time2 = time.time()
    time2 - time1  #0.3639109134674072
    
    time1 = time.time()
    dist = euclidean_distances(a, a)
    time2 = time.time()
    time2-time1  #0.08735871315002441
    
    0 讨论(0)
  • 2020-12-31 10:41

    this is doing twice as much work as needed, but technically works for non-symmetric distance matrices as well ( whatever that is supposed to mean )

    pd.DataFrame ( { idx1: { idx2:sum( DistMatrix[ x ][ y ]
                                      for (x, y) in zip( row1, row2 ) ) 
                             for (idx2, row2) in sample.iterrows( ) } 
                     for (idx1, row1 ) in sample.iterrows( ) } )
    

    you can make it more readable by writing it in pieces:

    # a helper function to compute distance of two items
    dist = lambda xs, ys: sum( DistMatrix[ x ][ y ] for ( x, y ) in zip( xs, ys ) )
    
    # a second helper function to compute distances from a given item
    xdist = lambda x: { idx: dist( x, y ) for (idx, y) in sample.iterrows( ) }
    
    # the pairwise distance matrix
    pd.DataFrame( { idx: xdist( x ) for ( idx, x ) in sample.iterrows( ) } )
    
    0 讨论(0)
  • 2020-12-31 10:46

    This is an old question, but there is a Scipy function that does this:

    from scipy.spatial.distance import pdist, squareform
    
    distances = pdist(sample.values, metric='euclidean')
    dist_matrix = squareform(distances)
    

    pdist operates on Numpy matrices, and DataFrame.values is the underlying Numpy NDarray representation of the data frame. The metric argument allows you to select one of several built-in distance metrics, or you can pass in any binary function to use a custom distance. It's very powerful and, in my experience, very fast. The result is a "flat" array that consists only of the upper triangle of the distance matrix (because it's symmetric), not including the diagonal (because it's always 0). squareform then translates this flattened form into a full matrix.

    The docs have more info, including a mathematical rundown of the many built-in distance functions.

    0 讨论(0)
提交回复
热议问题