问题
I have a CSV file which have content as belows and I want to calculate the cosine similarity from one the remaining ID in the CSV file.
I have load it into a dataframe of pandas as follows:
old_df['Vector']=old_df.apply(lambda row:
np.array(np.matrix(row.Vector)).ravel(), axis = 1)
l=[]
for a in old_df['Vector']:
l.append(a)
A=np.array(l)
similarities = cosine_similarity(A)
The output looks fine. However, i do not know how to find which the GUID (or ID)similar to other GUID (or ID), and I only want to get the top k have highest similar score.
Could you pls help me to solve this issue.
Thank you.
|Index | GUID | Vector |
|-------|-------|---------------------------------------|
|36099 | b770 |[-0.04870541 -0.02133574 0.03180726] |
|36098 | 808f |[ 0.0732905 -0.05331331 0.06378368] |
|36097 | b111 |[ 0.01994788 0.00417582 -0.09615131] |
|36096 | b6b5 |[0.025697 -0.08277534 -0.0124591] |
|36083 | 9b07 |[ 0.025697 -0.08277534 -0.0124591] |
|36082 | b9ed |[-0.00952298 0.06188576 -0.02636449] |
|36081 | a5b6 |[0.00432161 0.02264584 -0.0341924] |
|36080 | 9891 |[ 0.08732156 0.00649456 -0.02014138] |
|36079 | ba40 |[0.05407356 -0.09085554 -0.07671648] |
|36078 | 9dff |[-0.09859556 0.04498474 -0.01839088] |
|36077 | a423 |[-0.06124249 0.06774347 -0.05234318] |
|36076 | 81c4 |[0.07278682 -0.10460124 -0.06572364] |
|36075 | 9f88 |[0.09830415 0.05489364 -0.03916228] |
|36074 | adb8 |[0.03149953 -0.00486591 0.01380711] |
|36073 | 9765 |[0.00673934 0.0513557 -0.09584251] |
|36072 | aff4 |[-0.00097896 0.0022945 0.01643319] |
回答1:
Example code to get top k cosine similarities and they corresponding GUID and row ID:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
data = {"GUID": ["b770", "808f", "b111"], "Vector": [[-0.1, -0.2, 0.3], [0.1, -0.2, -0.3], [-0.1, 0.2, -0.3]]}
df = pd.DataFrame(data)
print("Data: \n{}\n".format(df))
vectors = []
for v in df['Vector']:
vectors.append(v)
vectors_num = len(vectors)
A=np.array(vectors)
# Get similarities matrix
similarities = cosine_similarity(A)
similarities[np.tril_indices(vectors_num)] = -2
print("Similarities: \n{}\n".format(similarities))
k = 2
if k > vectors_num:
K = vectors_num
# Get top k similarities and pair GUID in ascending order
top_k_indexes = np.unravel_index(np.argsort(similarities.ravel())[-k:], similarities.shape)
top_k_similarities = similarities[top_k_indexes]
top_k_pair_GUID = []
for indexes in top_k_indexes:
pair_GUID = (df.iloc[indexes[0]]["GUID"], df.iloc[indexes[1]]["GUID"])
top_k_pair_GUID.append(pair_GUID)
print("top_k_indexes: \n{}\ntop_k_pair_GUID: \n{}\ntop_k_similarities: \n{}".format(top_k_indexes, top_k_pair_GUID, top_k_similarities))
Outputs:
Data:
GUID Vector
0 b770 [-0.1, -0.2, 0.3]
1 808f [0.1, -0.2, -0.3]
2 b111 [-0.1, 0.2, -0.3]
Similarities:
[[-2. -0.42857143 -0.85714286]
[-2. -2. 0.28571429]
[-2. -2. -2. ]]
top_k_indexes:
(array([0, 1], dtype=int64), array([1, 2], dtype=int64))
top_k_pair_GUID:
[('b770', '808f'), ('808f', 'b111')]
top_k_similarities:
[-0.42857143 0.28571429]
来源:https://stackoverflow.com/questions/65402052/cosine-similarity-rows-in-a-dataframe-of-pandas