I'm trying to compute the tf-idf vector cosine similarity between two columns in a Pandas dataframe. One column contains a search query, the other contains a product title. The cosine similarity value is intended to be a "feature" for a search engine/ranking machine learning algorithm.
I'm doing this in an iPython notebook and am unfortunately running into MemoryErrors and am not sure why after a few hours of digging.
My setup:
- Lenovo E560 laptop
- Core i7-6500U @ 2.50 GHz
- 16 GB Ram
- Windows 10
- Using the anaconda 3.5 kernel with a fresh update of all libraries
I've tested my code/goal on a small toy dataset as per a similar stackoverflow question thusly:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import spatial
clf = TfidfVectorizer()
a = ['hello world', 'my name is', 'what is your name?', 'max cosine sim']
b = ['my name is', 'hello world', 'my name is what?', 'max cosine sim']
df = pd.DataFrame(data={'a':a, 'b':b})
clf.fit(df['a'] + " " + df['b'])
tfidf_a = clf.transform(df['a']).todense()
tfidf_b = clf.transform(df['b']).todense()
row_similarities = [1 - spatial.distance.cosine(tfidf_a[x],tfidf_b[x]) for x in range(len(tfidf_a)) ]
df['tfidf_cosine_similarity'] = row_similarities
print(df)
This gives the following (good!) output:
a b tfidf_cosine_similarity
0 hello world my name is 0.000000
1 my name is hello world 0.000000
2 what is your name? my name is what? 0.725628
3 max cosine sim max cosine sim 1.000000
However, when I try to apply the same method to a dataframe (df_all_export) with dimensions 186,154 x 5 (where 2 of the 5 columns the query (search_term) and document (product_title) as such:
clf.fit(df_all_export['search_term'] + " " + df_all_export['product_title'])
tfidf_a = clf.transform(df_all_export['search_term']).todense()
tfidf_b = clf.transform(df_all_export['product_title']).todense()
row_similarities = [1 - spatial.distance.cosine(tfidf_a[x],tfidf_b[x]) for x in range(len(tfidf_a)) ]
df_all_export['tfidf_cosine_similarity'] = row_similarities
df_all_export.head()
I get...(haven't given the whole error here but you get the idea):
MemoryError Traceback (most recent call last)
<ipython-input-27-8308fcfa8f9f> in <module>()
12 clf.fit(df_all_export['search_term'] + " " + df_all_export['product_title'])
13
---> 14 tfidf_a = clf.transform(df_all_export['search_term']).todense()
15 tfidf_b = clf.transform(df_all_export['product_title']).todense()
16
Absolutely lost on this one, but I fear the solution will be quite simple and elegant :)
Thank you in advance!
You can still work with sparsed matrixes / arrays using sklearn.metrics.pairwise methods:
# I've executed your example up to (including):
# ...
clf.fit(df['a'] + " " + df['b'])
A = clf.transform(df['a'])
B = clf.transform(df['b'])
from sklearn.metrics.pairwise import *
paired_cosine_distances
will show you how far or how different are your strings (compare values in two columns "row-by-row")
0
- means full match
In [136]: paired_cosine_distances(A, B)
Out[136]: array([ 1. , 1. , 0.27437247, 0. ])
cosine_similarity
will compare first string of column a
with all strings in column b
(row 1); second string of column a
with all strings in column b
(row 2) and so on...
In [137]: cosine_similarity(A, B)
Out[137]:
array([[ 0. , 1. , 0. , 0. ],
[ 1. , 0. , 0.74162106, 0. ],
[ 0.43929881, 0. , 0.72562753, 0. ],
[ 0. , 0. , 0. , 1. ]])
In [141]: A
Out[141]:
<4x10 sparse matrix of type '<class 'numpy.float64'>'
with 12 stored elements in Compressed Sparse Row format>
In [142]: B
Out[142]:
<4x10 sparse matrix of type '<class 'numpy.float64'>'
with 12 stored elements in Compressed Sparse Row format>
NOTE: all calculations have been donw using sparsed matrixes - we didn't uncompress them in memory!
With the kind help and solution posted by MaxU above, here I present the full code that completed the task I was trying to achieve. In addition to MemoryError
tt also dodges weird nans appearing in the cosine-similarity calculations when I tried some "hacky" workarounds.
Noting the below code is a partial snippet in the sense the large dataframe df_all_export
with dimensions 186,134 x 5
has already been constructed in the full code.
I hope this helps others who are trying to calculate cosine similarity using tf-idf vectors, between search queries and matched documents. For such a common "problem" I struggled to find a clear solution implemented with SKLearn and Pandas.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import paired_cosine_distances as pcd
clf = TfidfVectorizer()
clf.fit(df_all_export['search_term'] + " " + df_all_export['product_title'])
A = clf.transform(df_all_export['search_term'])
B = clf.transform(df_all_export['product_title'])
cosine = 1 - pcd(A, B)
df_all_export['tfidf_cosine'] = cosine
来源:https://stackoverflow.com/questions/42965181/python-memoryerror-when-computing-tf-idf-cosine-similarity-between-two-columns