According to the Gensim Word2Vec, I can use the word2vec model in gensim package to calculate the similarity between 2 words.
e.g.
trained_model.simi
Facebook Research group released a new solution called InferSent Results and code are published on Github, check their repo. It is pretty awesome. I am planning to use it. https://github.com/facebookresearch/InferSent
their paper https://arxiv.org/abs/1705.02364 Abstract: Many modern NLP systems rely on word embeddings, previously trained in an unsupervised manner on large corpora, as base features. Efforts to obtain embeddings for larger chunks of text, such as sentences, have however not been so successful. Several attempts at learning unsupervised representations of sentences have not reached satisfactory enough performance to be widely adopted. In this paper, we show how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks. Much like how computer vision uses ImageNet to obtain features, which can then be transferred to other tasks, our work tends to indicate the suitability of natural language inference for transfer learning to other NLP tasks. Our encoder is publicly available.
You can just add the word vectors of one sentence together. Then count the Cosine similarity of two sentence vector as the similarity of two sentence. I think that's the most easy way.
This is actually a pretty challenging problem that you are asking. Computing sentence similarity requires building a grammatical model of the sentence, understanding equivalent structures (e.g. "he walked to the store yesterday" and "yesterday, he walked to the store"), finding similarity not just in the pronouns and verbs but also in the proper nouns, finding statistical co-occurences / relationships in lots of real textual examples, etc.
The simplest thing you could try -- though I don't know how well this would perform and it would certainly not give you the optimal results -- would be to first remove all "stop" words (words like "the", "an", etc. that don't add much meaning to the sentence) and then run word2vec on the words in both sentences, sum up the vectors in the one sentence, sum up the vectors in the other sentence, and then find the difference between the sums. By summing them up instead of doing a word-wise difference, you'll at least not be subject to word order. That being said, this will fail in lots of ways and isn't a good solution by any means (though good solutions to this problem almost always involve some amount of NLP, machine learning, and other cleverness).
So, short answer is, no, there's no easy way to do this (at least not to do it well).
If you are using word2vec, you need to calculate the average vector for all words in every sentence/document and use cosine similarity between vectors:
import numpy as np
from scipy import spatial
index2word_set = set(model.wv.index2word)
def avg_feature_vector(sentence, model, num_features, index2word_set):
words = sentence.split()
feature_vec = np.zeros((num_features, ), dtype='float32')
n_words = 0
for word in words:
if word in index2word_set:
n_words += 1
feature_vec = np.add(feature_vec, model[word])
if (n_words > 0):
feature_vec = np.divide(feature_vec, n_words)
return feature_vec
Calculate similarity:
s1_afv = avg_feature_vector('this is a sentence', model=model, num_features=300, index2word_set=index2word_set)
s2_afv = avg_feature_vector('this is also sentence', model=model, num_features=300, index2word_set=index2word_set)
sim = 1 - spatial.distance.cosine(s1_afv, s2_afv)
print(sim)
> 0.915479828613
There is a function from the documentation taking a list of words and comparing their similarities.
s1 = 'This room is dirty'
s2 = 'dirty and disgusting room' #corrected variable name
distance = model.wv.n_similarity(s1.lower().split(), s2.lower().split())
Once you compute the sum of the two sets of word vectors, you should take the cosine between the vectors, not the diff. The cosine can be computed by taking the dot product of the two vectors normalized. Thus, the word count is not a factor.