How to calculate the sentence similarity using word2vec model of gensim with python

后端未结

关注

 14  1216

According to the Gensim Word2Vec, I can use the word2vec model in gensim package to calculate the similarity between 2 words.

e.g.

trained_model.simi


                      
              相关标签:


      
      
        
          14条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  失恋的感觉        
                
              
                            
                2020-11-28 00:57
              
            
            
                                                                       
Since you're using gensim, you should probably use it's doc2vec implementation. doc2vec is an extension of word2vec to the phrase-, sentence-, and document-level. It's a pretty simple extension, described here

http://cs.stanford.edu/~quocle/paragraph_vector.pdf

Gensim is nice because it's intuitive, fast, and flexible. What's great is that you can grab the pretrained word embeddings from the official word2vec page and the syn0 layer of gensim's Doc2Vec model is exposed so that you can seed the word embeddings with these high quality vectors!

GoogleNews-vectors-negative300.bin.gz (as linked in Google Code)

I think gensim is definitely the easiest (and so far for me, the best) tool for embedding a sentence in a vector space.

There exist other sentence-to-vector techniques than the one proposed in Le & Mikolov's paper above. Socher and Manning from Stanford are certainly two of the most famous researchers working in this area. Their work has been based on the principle of compositionally - semantics of the sentence come from:

1. semantics of the words

2. rules for how these words interact and combine into phrases


They've proposed a few such models (getting increasingly more complex) for how to use compositionality to build sentence-level representations.

2011 - unfolding recursive autoencoder (very comparatively simple. start here if interested)

2012 - matrix-vector neural network

2013 - neural tensor network

2015 - Tree LSTM

his papers are all available at socher.org. Some of these models are available, but I'd still recommend gensim's doc2vec. For one, the 2011 URAE isn't particularly powerful. In addition, it comes pretrained with weights suited for paraphrasing news-y data. The code he provides does not allow you to retrain the network. You also can't swap in different word vectors, so you're stuck with 2011's pre-word2vec embeddings from Turian. These vectors are certainly not on the level of word2vec's or GloVe's.

Haven't worked with the Tree LSTM yet, but it seems very promising!

tl;dr Yeah, use gensim's doc2vec. But other methods do exist!
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情深已故        
                
              
                            
                2020-11-28 00:58
              
            
            
                                                                       
I am using the following method and it works well.
You first need to run a POSTagger and then filter your sentence to get rid of the stop words (determinants, conjunctions, ...). I recommend TextBlob APTagger.
Then you build a word2vec by taking the mean of each word vector in the sentence. The n_similarity method in Gemsim word2vec does exactly that by allowing to pass two sets of words to compare.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  小鲜肉        
                
              
                            
                2020-11-28 00:58
              
            
            
                                                                       
Gensim implements a model called Doc2Vec for paragraph embedding.

There are different tutorials presented as IPython notebooks:


Doc2Vec Tutorial on the Lee Dataset 
Gensim Doc2Vec Tutorial on the IMDB Sentiment Dataset
Doc2Vec to wikipedia articles


Another method would rely on Word2Vec and Word Mover's Distance (WMD), as shown in this tutorial:


Finding similar documents with Word2Vec and WMD 


An alternative solution would be to rely on average vectors:

from gensim.models import KeyedVectors
from gensim.utils import simple_preprocess    

def tidy_sentence(sentence, vocabulary):
    return [word for word in simple_preprocess(sentence) if word in vocabulary]    

def compute_sentence_similarity(sentence_1, sentence_2, model_wv):
    vocabulary = set(model_wv.index2word)    
    tokens_1 = tidy_sentence(sentence_1, vocabulary)    
    tokens_2 = tidy_sentence(sentence_2, vocabulary)    
    return model_wv.n_similarity(tokens_1, tokens_2)

wv = KeyedVectors.load('model.wv', mmap='r')
sim = compute_sentence_similarity('this is a sentence', 'this is also a sentence', wv)
print(sim)


Finally, if you can run Tensorflow, you may try:
https://tfhub.dev/google/universal-sentence-encoder/2
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一个人的身影        
                
              
                            
                2020-11-28 01:00
              
            
            
                                                                       
There are extensions of Word2Vec intended to solve the problem of comparing longer pieces of text like phrases or sentences. One of them is paragraph2vec or doc2vec.

"Distributed Representations of Sentences and Documents"
http://cs.stanford.edu/~quocle/paragraph_vector.pdf

http://rare-technologies.com/doc2vec-tutorial/
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉酒成梦        
                
              
                            
                2020-11-28 01:00
              
            
            
                                                                       
If not using Word2Vec we have other model to find it using BERT for embed.
Below are reference link 
https://github.com/UKPLab/sentence-transformers

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer
import scipy.spatial

embedder = SentenceTransformer('bert-base-nli-mean-tokens')

# Corpus with example sentences
corpus = ['A man is eating a food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]
corpus_embeddings = embedder.encode(corpus)

# Query sentences:
queries = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']
query_embeddings = embedder.encode(queries)

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
closest_n = 5
for query, query_embedding in zip(queries, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx, distance in results[0:closest_n]:
        print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))


Other Link to follow
https://github.com/hanxiao/bert-as-service
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  天命终不由人        
                
              
                            
                2020-11-28 01:03
              
            
            
                                                                       
I would like to update the existing solution to help the people who are going to calculate the semantic similarity of sentences.

Step 1:

Load the suitable model using gensim and calculate the word vectors for words in the sentence and store them as a word list

Step 2 :
Computing the sentence vector

The calculation of semantic similarity between sentences was difficult before but recently a paper named "A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE
EMBEDDINGS" was proposed which suggests a simple approach by computing the weighted average of word vectors in the sentence and then remove the projections of the average vectors on their first principal component.Here the weight of a word w is a/(a + p(w)) with a being a parameter and p(w) the (estimated) word frequency called smooth inverse frequency.this method performing significantly better.

A simple code to calculate the sentence vector using SIF(smooth inverse frequency) the method proposed in the paper has been given here

Step 3:
using sklearn cosine_similarity load two vectors for the sentences and compute the similarity.

This is the most simple and efficient method to compute the sentence similarity.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
3
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复