Python: MemoryError when computing tf-idf cosine similarity between two columns in Pandas

问题

I'm trying to compute the tf-idf vector cosine similarity between two columns in a Pandas dataframe. One column contains a search query, the other contains a product title. The cosine similarity value is intended to be a "feature" for a search engine/ranking machine learning algorithm.

I'm doing this in an iPython notebook and am unfortunately running into MemoryErrors and am not sure why after a few hours of digging.

My setup:

Lenovo E560 laptop
Core i7-6500U @ 2.50 GHz
16 GB Ram
Windows 10
Using the anaconda 3.5 kernel with a fresh update of all libraries

I've tested my code/goal on a small toy dataset as per a similar stackoverflow question thusly:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import spatial

clf = TfidfVectorizer()

a = ['hello world', 'my name is', 'what is your name?', 'max cosine sim']
b = ['my name is', 'hello world', 'my name is what?', 'max cosine sim']

df = pd.DataFrame(data={'a':a, 'b':b})

clf.fit(df['a'] + " " + df['b'])

tfidf_a = clf.transform(df['a']).todense()
tfidf_b = clf.transform(df['b']).todense()

row_similarities = [1 - spatial.distance.cosine(tfidf_a[x],tfidf_b[x]) for x in range(len(tfidf_a)) ]

df['tfidf_cosine_similarity'] = row_similarities

print(df)

This gives the following (good!) output:

                   a                 b  tfidf_cosine_similarity
0         hello world        my name is                 0.000000
1          my name is       hello world                 0.000000
2  what is your name?  my name is what?                 0.725628
3      max cosine sim    max cosine sim                 1.000000

However, when I try to apply the same method to a dataframe (df_all_export) with dimensions 186,154 x 5 (where 2 of the 5 columns the query (search_term) and document (product_title) as such:

clf.fit(df_all_export['search_term'] + " " + df_all_export['product_title'])

tfidf_a = clf.transform(df_all_export['search_term']).todense()
tfidf_b = clf.transform(df_all_export['product_title']).todense()

row_similarities = [1 - spatial.distance.cosine(tfidf_a[x],tfidf_b[x]) for x in range(len(tfidf_a)) ]
df_all_export['tfidf_cosine_similarity'] = row_similarities

df_all_export.head()

I get...(haven't given the whole error here but you get the idea):

MemoryError                               Traceback (most recent call last)
<ipython-input-27-8308fcfa8f9f> in <module>()
     12 clf.fit(df_all_export['search_term'] + " " + df_all_export['product_title'])
     13 
---> 14 tfidf_a = clf.transform(df_all_export['search_term']).todense()
     15 tfidf_b = clf.transform(df_all_export['product_title']).todense()
     16

Absolutely lost on this one, but I fear the solution will be quite simple and elegant :)

Thank you in advance!

回答1:

You can still work with sparsed matrixes / arrays using sklearn.metrics.pairwise methods:

# I've executed your example up to (including):
# ...
clf.fit(df['a'] + " " + df['b'])

A = clf.transform(df['a'])

B = clf.transform(df['b'])

from sklearn.metrics.pairwise import *

paired_cosine_distances will show you how far or how different are your strings (compare values in two columns "row-by-row")

0 - means full match

In [136]: paired_cosine_distances(A, B)
Out[136]: array([ 1.        ,  1.        ,  0.27437247,  0.        ])

cosine_similarity will compare first string of column a with all strings in column b (row 1); second string of column a with all strings in column b (row 2) and so on...

In [137]: cosine_similarity(A, B)
Out[137]:
array([[ 0.        ,  1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.74162106,  0.        ],
       [ 0.43929881,  0.        ,  0.72562753,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  1.        ]])

In [141]: A
Out[141]:
<4x10 sparse matrix of type '<class 'numpy.float64'>'
        with 12 stored elements in Compressed Sparse Row format>

In [142]: B
Out[142]:
<4x10 sparse matrix of type '<class 'numpy.float64'>'
        with 12 stored elements in Compressed Sparse Row format>

NOTE: all calculations have been donw using sparsed matrixes - we didn't uncompress them in memory!

回答2:

With the kind help and solution posted by MaxU above, here I present the full code that completed the task I was trying to achieve. In addition to MemoryError tt also dodges weird nans appearing in the cosine-similarity calculations when I tried some "hacky" workarounds.

Noting the below code is a partial snippet in the sense the large dataframe df_all_export with dimensions 186,134 x 5 has already been constructed in the full code.

I hope this helps others who are trying to calculate cosine similarity using tf-idf vectors, between search queries and matched documents. For such a common "problem" I struggled to find a clear solution implemented with SKLearn and Pandas.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import paired_cosine_distances as pcd

clf = TfidfVectorizer()

clf.fit(df_all_export['search_term'] + " " + df_all_export['product_title'])

A = clf.transform(df_all_export['search_term'])
B = clf.transform(df_all_export['product_title'])

cosine = 1 - pcd(A, B)

df_all_export['tfidf_cosine'] = cosine

来源：https://stackoverflow.com/questions/42965181/python-memoryerror-when-computing-tf-idf-cosine-similarity-between-two-columns

标签

python

pandas

scikit-learn

tf-idf

cosine-similarity