How to perform efficient queries with Gensim doc2vec?

问题

I’m working on a sentence similarity algorithm with the following use case: given a new sentence, I want to retrieve its n most similar sentences from a given set. I am using Gensim v.3.7.1, and I have trained both word2vec and doc2vec models. The results of the latter outperform word2vec’s, but I’m having trouble performing efficient queries with my Doc2Vec model. This model uses the distributed bag of words implementation (dm = 0).

I used to infer similarity using the built in method model.most_similar(), but this was not possible once I started training with more data that the one I want to query against. That's to say, I want to find the most similar sentence among a subset of my training dataset. My quick fix to this was comparing the vector of the new sentence with every vector on my set using cosine similarity, but obviously this does not scale as I have to compute loads of embeddings and make a lot of comparisons.

I successfully use word-mover distance for both of word2vec and doc2vec, but I get better results for doc2vec when using cosine similarity. How can I efficiently query a new document against my set using the PV-DBOW Doc2Vec model and a method from class Similarity?

I'm looking for a similar approach to what I do with WMD, but for doc2vec cosine similarity:

# set_to_query contains ~10% of the training data + some future updates
set_to_query_tokenized = [sentence.split() for sentence in set_to_query]
w2v_model = gensim.models.Word2Vec.load("my_w2v_model")
w2v_to_query = gensim.similarities.WmdSimilarity(
               corpus = set_to_query_tokenized,
               w2v_model = w2v_model,
               num_best=10
              )
new_query = "I want to find the most similar sentence to this one".split()
most_similar = w2v_to_query[new_query]

回答1:

Creating your own subset of vectors, as a KeyedVectors instance, isn't quite as easy as it could or should be.

But, you should be able to use a WordEmbeddingsKeyedVectors (even though you're working with doc-vectors) that you load with just the vectors of interest. I haven't tested this, but assuming d2v_model is your Doc2Vec model, and list_of_tags are the tags you want in your subset, try something like:

subset_vectors = WordEmbeddingsKeyedVectors(vector_size)
subset_vectors.add(list_of_tags, d2v_model.docvecs[list_of_tags])

Then you can perform the usual operations, like most_similar() on subset_vectors.

来源：https://stackoverflow.com/questions/56130065/how-to-perform-efficient-queries-with-gensim-doc2vec

标签

python

gensim

similarity

doc2vec

sentence-similarity