I\'m trying to compute the tf-idf vector cosine similarity between two columns in a Pandas dataframe. One column contains a search query, the other contains a product title. The
You can still work with sparsed matrixes / arrays using sklearn.metrics.pairwise methods:
# I've executed your example up to (including):
# ...
clf.fit(df['a'] + " " + df['b'])
A = clf.transform(df['a'])
B = clf.transform(df['b'])
from sklearn.metrics.pairwise import *
paired_cosine_distances
will show you how far or how different are your strings (compare values in two columns "row-by-row")
0
- means full match
In [136]: paired_cosine_distances(A, B)
Out[136]: array([ 1. , 1. , 0.27437247, 0. ])
cosine_similarity
will compare first string of column a
with all strings in column b
(row 1); second string of column a
with all strings in column b
(row 2) and so on...
In [137]: cosine_similarity(A, B)
Out[137]:
array([[ 0. , 1. , 0. , 0. ],
[ 1. , 0. , 0.74162106, 0. ],
[ 0.43929881, 0. , 0.72562753, 0. ],
[ 0. , 0. , 0. , 1. ]])
In [141]: A
Out[141]:
<4x10 sparse matrix of type ''
with 12 stored elements in Compressed Sparse Row format>
In [142]: B
Out[142]:
<4x10 sparse matrix of type ''
with 12 stored elements in Compressed Sparse Row format>
NOTE: all calculations have been donw using sparsed matrixes - we didn't uncompress them in memory!