Most efficient way to construct similarity matrix

前端 未结 5 1370
孤城傲影
孤城傲影 2020-12-31 13:53

I\'m using the following links to create a \"Euclidean Similarity Matrix\" (that I convert to a DataFrame). https://stats.stackexchange.com/questions/53068/euclidean-distan

相关标签:
5条回答
  • 2020-12-31 14:11

    The simplest way I can find to get the same result as the OP is to use distance_matrix, also from scipy.spatial. The whole thing can be done in one sort-of-long line.

    import numpy as np
    import pandas as pd
    from scipy.spatial import distance_matrix
    
    # Original code from OP, slightly reformatted
    DF_var = pd.DataFrame.from_dict({
        "s1":[1.2,3.4,10.2],
        "s2":[1.4,3.1,10.7],
        "s3":[2.1,3.7,11.3],
        "s4":[1.5,3.2,10.9]
    }).T
    DF_var.columns = ["g1","g2","g3"]
    
    # Whole similarity algorithm in one line
    df_euclid = pd.DataFrame(
        1 / (1 + distance_matrix(DF_var.T, DF_var.T)),
        columns=DF_var.columns, index=DF_var.columns
    )
    
    #           g1        g2        g3
    # g1  1.000000  0.215963  0.051408
    # g2  0.215963  1.000000  0.063021
    # g3  0.051408  0.063021  1.000000
    

    The code above should copy-paste and run in any python IDE.

    0 讨论(0)
  • 2020-12-31 14:20

    There are two useful function within scipy.spatial.distance that you can use for this: pdist and squareform. Using pdist will give you the pairwise distance between observations as a one-dimensional array, and squareform will convert this to a distance matrix.

    One catch is that pdist uses distance measures by default, and not similarity, so you'll need to manually specify your similarity function. Judging by the commented output in your code, your DataFrame is also not in the orientation pdist expects, so I've undone the transpose you did in your code.

    import pandas as pd
    from scipy.spatial.distance import euclidean, pdist, squareform
    
    
    def similarity_func(u, v):
        return 1/(1+euclidean(u,v))
    
    DF_var = pd.DataFrame.from_dict({"s1":[1.2,3.4,10.2],"s2":[1.4,3.1,10.7],"s3":[2.1,3.7,11.3],"s4":[1.5,3.2,10.9]})
    DF_var.index = ["g1","g2","g3"]
    
    dists = pdist(DF_var, similarity_func)
    DF_euclid = pd.DataFrame(squareform(dists), columns=DF_var.index, index=DF_var.index)
    
    0 讨论(0)
  • 2020-12-31 14:31

    You want scipy.spatial.distance.pdist or sklearn.metrics.pairwise.pairwise_distances

    0 讨论(0)
  • 2020-12-31 14:32

    This is what I did:

    from scipy.spatial.distance import euclidean
    
    DF_var = pd.DataFrame.from_dict({"s1":[1.2,3.4,10.2],"s2":[1.4,3.1,10.7],"s3":[2.1,3.7,11.3],"s4":[1.5,3.2,10.9]}).T
    DF_var.columns = ["g1","g2","g3"]
    
    def m_euclid(v1, v2):
        return (1/(1 + euclidean(v1,v2)))
    
    dist_list = []
    for j1 in DF_var.columns:
        dist_list.append([m_euclid(DF_var[j1], DF_var[j2]) for j2 in DF_var.columns])
    
    dist_matrix = pd.DataFrame(dist_list)
    
    0 讨论(0)
  • 2020-12-31 14:37

    I think you can just use pdist and squareform to broadcast directly on your DataFrame:

    from scipy.spatial.distance import pdist,squareform
    
    In [6]: squareform(pdist(DF_var, metric='euclidean'))
    
    Out[6]:
    array([[ 0.        ,  0.6164414 ,  1.4525839 ,  0.78740079],
           [ 0.6164414 ,  0.        ,  1.1       ,  0.24494897],
           [ 1.4525839 ,  1.1       ,  0.        ,  0.87749644],
           [ 0.78740079,  0.24494897,  0.87749644,  0.        ]])
    
    0 讨论(0)
提交回复
热议问题