Compare a list with the rows in pandas using Cosine similarity and get the rank

大憨熊 提交于 2021-02-08 12:01:22

问题


I have a Pandas Dataframe and a user input , i would require to compare the user input with each of the rows in the dataframe and get the Ranked list of rows in the dataframe based on Cosine Similarties.

Department  Country Age Grade   Score
Math    India   Young   A   97
Math    India   Young   B   86
Math    India   Young   D   68
Science India   Young   A   92
Science India   Young   B   81
Science India   Young   C   76
Social  India   Young   B   88
Social  India   Young   D   62
Social  India   Young   C   72

User input :

Country Age Grade   Score
India   Young   B   84
India   Young   D   65
India   Young   A   98

I would prefer to consider all the rows of the dataframe as lists, and consider the User input as list. Say User_list1 = ['India','Young','B','84']and compare it using Cosine Similarlity with each rows of the dataframe (considering them as a list) and get the Ranked output of Department.

In my case, the output will be the Ranked list of Department : Out = ['Math','Science','Social'] : This should based on Cosine Similarity results.


回答1:


Considering both of dataframes as above,

df
   Department Country Age Grade Score
0   Math    India   Young   A   97
1   Math    India   Young   B   86
2   Math    India   Young   D   68
3   Science India   Young   A   92
4   Science India   Young   B   81
5   Science India   Young   C   76
6   Social  India   Young   B   88
7   Social  India   Young   D   62
8   Social  India   Young   C   72

input

Country Age Grade   Score
0   India   Young   B   84
1   India   Young   D   65
2   India   Young   A   98

One of possible solution is,

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
import numpy as np
from collections import OrderedDict
import sys

Convert categorical features to numeric using scikit-learn package,

df['Country'] = le.fit_transform(df['Country'])
df['Age'] = le.fit_transform(df['Age'])
df['Grade'] = le.fit_transform(df['Grade'])
df

Output:

Department Country Age Grade    Score
0       Math    0      0    0   97
1       Math    0      0    1   86
2       Math    0      0    3   68
3      Science  0      0    0   92
4      Science  0      0    1   81
5      Science  0      0    2   76
6      Social   0      0    1   88
7      Social   0      0    3   62
8      Social   0      0    2   72

input['Country'] = le.fit_transform(input['Country'])
input['Age'] = le.fit_transform(input['Age'])
input['Grade'] = le.fit_transform(input['Grade'])
input

Output:

 Country  Age   Grade  Score
0   0       0     1     84
1   0       0     2     65
2   0       0     0     98

Define a cosine-similarity function,

def cosine_similarity(a, b):
    nom = np.sum(np.multiply(a, b))
    denom = np.sqrt(np.sum(np.square(a))) * np.sqrt(np.sum(np.square(b)))
    sim = nom / denom
    return sim

dept = list(df['Department'].values)
dept = list(OrderedDict.fromkeys(dept).keys())
results = []
for i in range(len(input)):
    similarity = []
    for j in range(len(df)):
        a = input.iloc[i] 
        b = df.iloc[j, 1:]
        c_sim = cosine_similarity(a, b)
        similarity.append(c_sim)

    max_similarity = []
    for k in range(0, len(df), 3):
        max_3 = max(similarity[k:k+3])
        max_similarity.append(max_3)

    max_idx = max_similarity.index(max(max_similarity))
    results.append(dept[max_idx])
results

Output:

['Math', 'Social', 'Math']


来源:https://stackoverflow.com/questions/54033724/compare-a-list-with-the-rows-in-pandas-using-cosine-similarity-and-get-the-rank

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!