Compare a list with the rows in pandas using Cosine similarity and get the rank

问题

I have a Pandas Dataframe and a user input , i would require to compare the user input with each of the rows in the dataframe and get the Ranked list of rows in the dataframe based on Cosine Similarties.

Department  Country Age Grade   Score
Math    India   Young   A   97
Math    India   Young   B   86
Math    India   Young   D   68
Science India   Young   A   92
Science India   Young   B   81
Science India   Young   C   76
Social  India   Young   B   88
Social  India   Young   D   62
Social  India   Young   C   72

User input :

Country Age Grade   Score
India   Young   B   84
India   Young   D   65
India   Young   A   98

I would prefer to consider all the rows of the dataframe as lists, and consider the User input as list. Say User_list1 = ['India','Young','B','84']and compare it using Cosine Similarlity with each rows of the dataframe (considering them as a list) and get the Ranked output of Department.

In my case, the output will be the Ranked list of Department : Out = ['Math','Science','Social'] : This should based on Cosine Similarity results.

回答1:

Considering both of dataframes as above,

df
   Department Country Age Grade Score
0   Math    India   Young   A   97
1   Math    India   Young   B   86
2   Math    India   Young   D   68
3   Science India   Young   A   92
4   Science India   Young   B   81
5   Science India   Young   C   76
6   Social  India   Young   B   88
7   Social  India   Young   D   62
8   Social  India   Young   C   72

input

Country Age Grade   Score
0   India   Young   B   84
1   India   Young   D   65
2   India   Young   A   98

One of possible solution is,

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
import numpy as np
from collections import OrderedDict
import sys

Convert categorical features to numeric using scikit-learn package,

df['Country'] = le.fit_transform(df['Country'])
df['Age'] = le.fit_transform(df['Age'])
df['Grade'] = le.fit_transform(df['Grade'])
df

Output:

Department Country Age Grade    Score
0       Math    0      0    0   97
1       Math    0      0    1   86
2       Math    0      0    3   68
3      Science  0      0    0   92
4      Science  0      0    1   81
5      Science  0      0    2   76
6      Social   0      0    1   88
7      Social   0      0    3   62
8      Social   0      0    2   72

input['Country'] = le.fit_transform(input['Country'])
input['Age'] = le.fit_transform(input['Age'])
input['Grade'] = le.fit_transform(input['Grade'])
input

Output:

 Country  Age   Grade  Score
0   0       0     1     84
1   0       0     2     65
2   0       0     0     98

Define a cosine-similarity function,

def cosine_similarity(a, b):
    nom = np.sum(np.multiply(a, b))
    denom = np.sqrt(np.sum(np.square(a))) * np.sqrt(np.sum(np.square(b)))
    sim = nom / denom
    return sim

dept = list(df['Department'].values)
dept = list(OrderedDict.fromkeys(dept).keys())
results = []
for i in range(len(input)):
    similarity = []
    for j in range(len(df)):
        a = input.iloc[i] 
        b = df.iloc[j, 1:]
        c_sim = cosine_similarity(a, b)
        similarity.append(c_sim)

    max_similarity = []
    for k in range(0, len(df), 3):
        max_3 = max(similarity[k:k+3])
        max_similarity.append(max_3)

    max_idx = max_similarity.index(max(max_similarity))
    results.append(dept[max_idx])
results

Output:

['Math', 'Social', 'Math']

来源：https://stackoverflow.com/questions/54033724/compare-a-list-with-the-rows-in-pandas-using-cosine-similarity-and-get-the-rank

标签

python

python-3.x

cosine-similarity