问题
I have a Pandas Dataframe and a user input , i would require to compare the user input with each of the rows in the dataframe and get the Ranked list of rows in the dataframe based on Cosine Similarties.
Department Country Age Grade Score
Math India Young A 97
Math India Young B 86
Math India Young D 68
Science India Young A 92
Science India Young B 81
Science India Young C 76
Social India Young B 88
Social India Young D 62
Social India Young C 72
User input :
Country Age Grade Score
India Young B 84
India Young D 65
India Young A 98
I would prefer to consider all the rows of the dataframe as lists,
and consider the User input as list.
Say User_list1 = ['India','Young','B','84']
and compare it using Cosine Similarlity with each rows of the dataframe (considering them as a list) and get the Ranked output of Department
.
In my case, the output will be the Ranked list of Department :
Out = ['Math','Science','Social']
: This should based on Cosine Similarity results.
回答1:
Considering both of dataframes as above,
df
Department Country Age Grade Score
0 Math India Young A 97
1 Math India Young B 86
2 Math India Young D 68
3 Science India Young A 92
4 Science India Young B 81
5 Science India Young C 76
6 Social India Young B 88
7 Social India Young D 62
8 Social India Young C 72
input
Country Age Grade Score
0 India Young B 84
1 India Young D 65
2 India Young A 98
One of possible solution is,
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
import numpy as np
from collections import OrderedDict
import sys
Convert categorical features to numeric using scikit-learn
package,
df['Country'] = le.fit_transform(df['Country'])
df['Age'] = le.fit_transform(df['Age'])
df['Grade'] = le.fit_transform(df['Grade'])
df
Output:
Department Country Age Grade Score
0 Math 0 0 0 97
1 Math 0 0 1 86
2 Math 0 0 3 68
3 Science 0 0 0 92
4 Science 0 0 1 81
5 Science 0 0 2 76
6 Social 0 0 1 88
7 Social 0 0 3 62
8 Social 0 0 2 72
input['Country'] = le.fit_transform(input['Country'])
input['Age'] = le.fit_transform(input['Age'])
input['Grade'] = le.fit_transform(input['Grade'])
input
Output:
Country Age Grade Score
0 0 0 1 84
1 0 0 2 65
2 0 0 0 98
Define a cosine-similarity
function,
def cosine_similarity(a, b):
nom = np.sum(np.multiply(a, b))
denom = np.sqrt(np.sum(np.square(a))) * np.sqrt(np.sum(np.square(b)))
sim = nom / denom
return sim
dept = list(df['Department'].values)
dept = list(OrderedDict.fromkeys(dept).keys())
results = []
for i in range(len(input)):
similarity = []
for j in range(len(df)):
a = input.iloc[i]
b = df.iloc[j, 1:]
c_sim = cosine_similarity(a, b)
similarity.append(c_sim)
max_similarity = []
for k in range(0, len(df), 3):
max_3 = max(similarity[k:k+3])
max_similarity.append(max_3)
max_idx = max_similarity.index(max(max_similarity))
results.append(dept[max_idx])
results
Output:
['Math', 'Social', 'Math']
来源:https://stackoverflow.com/questions/54033724/compare-a-list-with-the-rows-in-pandas-using-cosine-similarity-and-get-the-rank