Distance calculation between rows in Pandas Dataframe using a distance matrix

前端 未结 3 868
小鲜肉
小鲜肉 2020-12-31 10:04

I have the following Pandas DataFrame:

In [31]:
import pandas as pd
sample = pd.DataFrame({\'Sym1\': [\'a\',\'a\',\'a\',\'d\'],\'Sym2\':[\'a\',\'c\',\'b\',\'         


        
3条回答
  •  孤城傲影
    2020-12-31 10:46

    This is an old question, but there is a Scipy function that does this:

    from scipy.spatial.distance import pdist, squareform
    
    distances = pdist(sample.values, metric='euclidean')
    dist_matrix = squareform(distances)
    

    pdist operates on Numpy matrices, and DataFrame.values is the underlying Numpy NDarray representation of the data frame. The metric argument allows you to select one of several built-in distance metrics, or you can pass in any binary function to use a custom distance. It's very powerful and, in my experience, very fast. The result is a "flat" array that consists only of the upper triangle of the distance matrix (because it's symmetric), not including the diagonal (because it's always 0). squareform then translates this flattened form into a full matrix.

    The docs have more info, including a mathematical rundown of the many built-in distance functions.

提交回复
热议问题