Python: MemoryError when computing tf-idf cosine similarity between two columns in Pandas

后端 未结 2 1471
南方客
南方客 2021-02-06 17:54

I\'m trying to compute the tf-idf vector cosine similarity between two columns in a Pandas dataframe. One column contains a search query, the other contains a product title. The

2条回答
  •  慢半拍i
    慢半拍i (楼主)
    2021-02-06 18:32

    You can still work with sparsed matrixes / arrays using sklearn.metrics.pairwise methods:

    # I've executed your example up to (including):
    # ...
    clf.fit(df['a'] + " " + df['b'])
    
    A = clf.transform(df['a'])
    
    B = clf.transform(df['b'])
    
    from sklearn.metrics.pairwise import *
    

    paired_cosine_distances will show you how far or how different are your strings (compare values in two columns "row-by-row")

    0 - means full match

    In [136]: paired_cosine_distances(A, B)
    Out[136]: array([ 1.        ,  1.        ,  0.27437247,  0.        ])
    

    cosine_similarity will compare first string of column a with all strings in column b (row 1); second string of column a with all strings in column b (row 2) and so on...

    In [137]: cosine_similarity(A, B)
    Out[137]:
    array([[ 0.        ,  1.        ,  0.        ,  0.        ],
           [ 1.        ,  0.        ,  0.74162106,  0.        ],
           [ 0.43929881,  0.        ,  0.72562753,  0.        ],
           [ 0.        ,  0.        ,  0.        ,  1.        ]])
    
    In [141]: A
    Out[141]:
    <4x10 sparse matrix of type ''
            with 12 stored elements in Compressed Sparse Row format>
    
    In [142]: B
    Out[142]:
    <4x10 sparse matrix of type ''
            with 12 stored elements in Compressed Sparse Row format>
    

    NOTE: all calculations have been donw using sparsed matrixes - we didn't uncompress them in memory!

提交回复
热议问题