Python: MemoryError when computing tf-idf cosine similarity between two columns in Pandas

后端未结

关注

 2  1474

I\'m trying to compute the tf-idf vector cosine similarity between two columns in a Pandas dataframe. One column contains a search query, the other contains a product title. The

相关标签:

2条回答

春和景丽

2021-02-06 18:16
With the kind help and solution posted by MaxU above, here I present the full code that completed the task I was trying to achieve. In addition to MemoryError tt also dodges weird nans appearing in the cosine-similarity calculations when I tried some "hacky" workarounds.

Noting the below code is a partial snippet in the sense the large dataframe df_all_export with dimensions 186,134 x 5 has already been constructed in the full code.

I hope this helps others who are trying to calculate cosine similarity using tf-idf vectors, between search queries and matched documents. For such a common "problem" I struggled to find a clear solution implemented with SKLearn and Pandas.
```
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import paired_cosine_distances as pcd

clf = TfidfVectorizer()

clf.fit(df_all_export['search_term'] + " " + df_all_export['product_title'])

A = clf.transform(df_all_export['search_term'])
B = clf.transform(df_all_export['product_title'])

cosine = 1 - pcd(A, B)

df_all_export['tfidf_cosine'] = cosine
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

慢半拍i

2021-02-06 18:32

You can still work with sparsed matrixes / arrays using sklearn.metrics.pairwise methods:

# I've executed your example up to (including):
# ...
clf.fit(df['a'] + " " + df['b'])

A = clf.transform(df['a'])

B = clf.transform(df['b'])

from sklearn.metrics.pairwise import *

paired_cosine_distances will show you how far or how different are your strings (compare values in two columns "row-by-row")

0 - means full match

In [136]: paired_cosine_distances(A, B)
Out[136]: array([ 1.        ,  1.        ,  0.27437247,  0.        ])

cosine_similarity will compare first string of column a with all strings in column b (row 1); second string of column a with all strings in column b (row 2) and so on...

In [137]: cosine_similarity(A, B)
Out[137]:
array([[ 0.        ,  1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.74162106,  0.        ],
       [ 0.43929881,  0.        ,  0.72562753,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  1.        ]])

In [141]: A
Out[141]:
<4x10 sparse matrix of type '<class 'numpy.float64'>'
        with 12 stored elements in Compressed Sparse Row format>

In [142]: B
Out[142]:
<4x10 sparse matrix of type '<class 'numpy.float64'>'
        with 12 stored elements in Compressed Sparse Row format>

NOTE: all calculations have been donw using sparsed matrixes - we didn't uncompress them in memory!

0 讨论(0)