问题
I am having trouble with calculating cosine similarity between large list of 100-dimensional vectors. When I use from sklearn.metrics.pairwise import cosine_similarity
, I get MemoryError
on my 16 GB machine. Each array fits perfectly in my memory but I get MemoryError
during np.dot()
internal call
Here's my use-case and how I am currently tackling it.
Here's my parent vector of 100-dimension which I need to compare with other 500,000 different vectors of same dimension (i.e. 100)
parent_vector = [1, 2, 3, 4 ..., 100]
Here are my child vectors (with some made-up random numbers for this example)
child_vector_1 = [2, 3, 4, ....., 101]
child_vector_2 = [3, 4, 5, ....., 102]
child_vector_3 = [4, 5, 6, ....., 103]
.......
.......
child_vector_500000 = [3, 4, 5, ....., 103]
My final goal is to get top-N child vectors (with their names such as child_vector_1
and their corresponding cosine score) who have very high cosine similarity with the parent vector.
My current approach (which I know is inefficient and memory consuming):
Step 1: Create a super-dataframe of following shape
parent_vector 1, 2, 3, ....., 100
child_vector_1 2, 3, 4, ....., 101
child_vector_2 3, 4, 5, ....., 102
child_vector_3 4, 5, 6, ....., 103
......................................
child_vector_500000 3, 4, 5, ....., 103
Step 2: Use
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df)
to get pair-wise cosine similarity between all vectors (shown in above dataframe)
Step 3: Make a list of tuple to store the key
such as child_vector_1
and value such as the cosine similarity number for all such combinations.
Step 4: Get the top-N using sort()
of list -- so that I get the child vector name as well as its cosine similarity score with the parent vector.
PS: I know this is highly inefficient but I couldn't think of a better way to faster compute cosine similarity between each of child vector and parent vector and get the top-N values.
Any help would be highly appreciated.
回答1:
even though your (500000, 100) array (the parent and its children) fits into memory any pairwise metric on it won't. The reason for that is that pairwise metric as the name suggests computes the distance for any two children. In order to store these distances you would need a (500000,500000) sized array of floats which if my calculations are right would take about 100 GB of memory.
Thankfully there is an easy solution for your problem. If I understand you correctly you only want to have the distance between child and parents which will result in a vector of length 500000 which is easily stored in memory.
To do this, you simply need to provide a second argument to cosine_similarity containing only the parent_vector
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
df = pd.DataFrame(np.random.rand(500000,100))
df['distances'] = cosine_similarity(df, df.iloc[0:1]) # Here I assume that the parent vector is stored as the first row in the dataframe, but you could also store it separately
n = 10 # or however many you want
n_largest = df['distances'].nlargest(n + 1) # this contains the parent itself as the most similar entry, hence n+1 to get n children
hope that solves your question.
回答2:
This solution is insanely fast
child_vectors = np.array(child_vector_1, child_vector_2, ....., child_vector_500000)
input_norm = parent_vector / np.linalg.norm(parent_vector, axis=-1)[:, np.newaxis]
embed_norm = child_vectors/ np.linalg.norm(child_vectors, axis=-1)[:, np.newaxis]
cosine_similarities = np.sort(np.round(np.dot(input_norm, embed_norm.T), 3)[0])[::-1]
paiswise_distances = 1 - cosine_similarities
来源:https://stackoverflow.com/questions/53875473/cosine-similarity-for-very-large-dataset