Doc2Vec Get most similar documents

后端 未结 1 520
-上瘾入骨i
-上瘾入骨i 2021-01-30 04:00

I am trying to build a document retrieval model that returns most documents ordered by their relevancy with respect to a query or a search string. For this I trained a doc2vec m

1条回答
  •  轻奢々
    轻奢々 (楼主)
    2021-01-30 04:46

    You need to use infer_vector to get a document vector of the new text - which does not alter the underlying model.

    Here is how you do it:

    tokens = "a new sentence to match".split()
    
    new_vector = model.infer_vector(tokens)
    sims = model.docvecs.most_similar([new_vector]) #gives you top 10 document tags and their cosine similarity
    

    Edit:

    Here is an example of how the underlying model does not change after infer_vec is called.

    import numpy as np
    
    words = "king queen man".split()
    
    len_before =  len(model.docvecs) #number of docs
    
    #word vectors for king, queen, man
    w_vec0 = model[words[0]]
    w_vec1 = model[words[1]]
    w_vec2 = model[words[2]]
    
    new_vec = model.infer_vector(words)
    
    len_after =  len(model.docvecs)
    
    print np.array_equal(model[words[0]], w_vec0) # True
    print np.array_equal(model[words[1]], w_vec1) # True
    print np.array_equal(model[words[2]], w_vec2) # True
    
    print len_before == len_after #True
    

    0 讨论(0)
提交回复
热议问题