“Correlation matrix” for strings. Similarity of nominal data

后端 未结 2 1506
我在风中等你
我在风中等你 2021-01-24 07:05

Here is my data frame. df

  store_1      store_2         store_3         store_4     

0 banana      banana           plum            banana
1 orange      ta         


        
相关标签:
2条回答
  • 2021-01-24 07:45

    You can try something like this

    import itertools as it
    corr = lambda a,b: len(set(a).intersection(set(b)))/len(a)
    c = [corr(*x) for x in it.combinations_with_replacement(df.T.values.tolist(),2)]
    
    j = 0
    x = []
    for i in range(4, 0, -1): # replace 4 with df.shape[-1]
        x.append([np.nan]*(4-i) + c[j:j+i])
        j+= i
    pd.DataFrame(x, columns=df.columns, index=df.columns)
    

    Which yields

            store_1 store_2 store_3 store_4
    store_1 1.0     0.4     0.4     0.8
    store_2 NaN     1.0     0.2     0.4
    store_3 NaN     NaN     1.0     0.2
    store_4 NaN     NaN     NaN     1.0
    
    0 讨论(0)
  • 2021-01-24 07:51

    If you wish to estimate the similarity of the stores with regards to their products, then you could use:

    One hot encoding

    Then each stores can be described by a vector with length of n = number of all products among all stores such as:

    banana orange apple pear plum tangerin raspberry tomato melon . . .

    Store_1 then is described as 1 1 1 1 1 0 0 0 0 0 ... Store_2 1 0 0 1 0 1 1 1 0 ...

    This leaves you with a numerical vector, where you can compute dissimilarity measure such as Euclidean Distance.

    0 讨论(0)
提交回复
热议问题