Cosine similarity for very large dataset

问题

I am having trouble with calculating cosine similarity between large list of 100-dimensional vectors. When I use from sklearn.metrics.pairwise import cosine_similarity, I get MemoryError on my 16 GB machine. Each array fits perfectly in my memory but I get MemoryError during np.dot() internal call

Here's my use-case and how I am currently tackling it.

Here's my parent vector of 100-dimension which I need to compare with other 500,000 different vectors of same dimension (i.e. 100)

parent_vector = [1, 2, 3, 4 ..., 100]

Here are my child vectors (with some made-up random numbers for this example)

child_vector_1 = [2, 3, 4, ....., 101]
child_vector_2 = [3, 4, 5, ....., 102]
child_vector_3 = [4, 5, 6, ....., 103]
.......
.......
child_vector_500000 = [3, 4, 5, ....., 103]

My final goal is to get top-N child vectors (with their names such as child_vector_1 and their corresponding cosine score) who have very high cosine similarity with the parent vector.

My current approach (which I know is inefficient and memory consuming):

Step 1: Create a super-dataframe of following shape

parent_vector         1,    2,    3, .....,    100   
child_vector_1        2,    3,    4, .....,    101   
child_vector_2        3,    4,    5, .....,    102   
child_vector_3        4,    5,    6, .....,    103   
......................................   
child_vector_500000   3,    4,    5, .....,    103

Step 2: Use

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df)

to get pair-wise cosine similarity between all vectors (shown in above dataframe)

Step 3: Make a list of tuple to store the key such as child_vector_1 and value such as the cosine similarity number for all such combinations.

Step 4: Get the top-N using sort() of list -- so that I get the child vector name as well as its cosine similarity score with the parent vector.

PS: I know this is highly inefficient but I couldn't think of a better way to faster compute cosine similarity between each of child vector and parent vector and get the top-N values.

Any help would be highly appreciated.

回答1:

even though your (500000, 100) array (the parent and its children) fits into memory any pairwise metric on it won't. The reason for that is that pairwise metric as the name suggests computes the distance for any two children. In order to store these distances you would need a (500000,500000) sized array of floats which if my calculations are right would take about 100 GB of memory.

Thankfully there is an easy solution for your problem. If I understand you correctly you only want to have the distance between child and parents which will result in a vector of length 500000 which is easily stored in memory.

To do this, you simply need to provide a second argument to cosine_similarity containing only the parent_vector

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

df = pd.DataFrame(np.random.rand(500000,100)) 
df['distances'] = cosine_similarity(df, df.iloc[0:1]) # Here I assume that the parent vector is stored as the first row in the dataframe, but you could also store it separately

n = 10 # or however many you want
n_largest = df['distances'].nlargest(n + 1) # this contains the parent itself as the most similar entry, hence n+1 to get n children

hope that solves your question.

回答2:

This solution is insanely fast

child_vectors = np.array(child_vector_1, child_vector_2, ....., child_vector_500000)
input_norm = parent_vector / np.linalg.norm(parent_vector, axis=-1)[:, np.newaxis]
embed_norm =  child_vectors/ np.linalg.norm(child_vectors, axis=-1)[:, np.newaxis]
cosine_similarities = np.sort(np.round(np.dot(input_norm, embed_norm.T), 3)[0])[::-1]
paiswise_distances = 1 - cosine_similarities

来源：https://stackoverflow.com/questions/53875473/cosine-similarity-for-very-large-dataset

标签

python

numpy

dataframe

cosine-similarity