doc2vec

Document similarity: Vector embedding versus Tf-Idf performance?

允我心安 提交于 2020-04-09 18:37:25
问题 I have a collection of documents, where each document is rapidly growing with time. The task is to find similar documents at any fixed time. I have two potential approaches: A vector embedding (word2vec, GloVe or fasttext), averaging over word vectors in a document, and using cosine similarity. Bag-of-Words: tf-idf or its variations such as BM25. Will one of these yield a significantly better result? Has someone done a quantitative comparison of tf-idf versus averaging word2vec for document

Document similarity: Vector embedding versus Tf-Idf performance?

时光怂恿深爱的人放手 提交于 2020-04-09 18:36:07
问题 I have a collection of documents, where each document is rapidly growing with time. The task is to find similar documents at any fixed time. I have two potential approaches: A vector embedding (word2vec, GloVe or fasttext), averaging over word vectors in a document, and using cosine similarity. Bag-of-Words: tf-idf or its variations such as BM25. Will one of these yield a significantly better result? Has someone done a quantitative comparison of tf-idf versus averaging word2vec for document

How to use the infer_vector in gensim.doc2vec?

若如初见. 提交于 2020-01-29 05:29:13
问题 def cosine(vector1,vector2): cosV12 = np.dot(vector1, vector2) / (linalg.norm(vector1) * linalg.norm(vector2)) return cosV12 model=gensim.models.doc2vec.Doc2Vec.load('Model_D2V_Game') string='民生 为了 父亲 我 要 坚强 地 ...' list=string.split(' ') vector1=model.infer_vector(doc_words=list,alpha=0.1, min_alpha=0.0001,steps=5) vector2=model.docvecs.doctag_syn0[0] print cosine(vector2,vector1) -0.0232586 I use a train data to train a doc2vec model. Then, I use infer_vector() to generate a vector given a

Doc2vec output data for only a single document and not two documents vectors

落花浮王杯 提交于 2020-01-24 16:26:10
问题 I try to build a simple program to test on my understanding about Doc2Vec and it seems like I still have a long way to go before knowing it. I understand that each sentence in the document is first being labeled with its own label and for doc2vec it will learn vectors for these labels. For example, from what I could understand, lets say we have a list of lists with 3 sentences. [["I have a pet"], ["They have a pet"], ["she has no pet"]] We then break it into 3 sentences ["I have a pet"] [

Gensim Doc2Vec Exception AttributeError: 'str' object has no attribute 'words'

≯℡__Kan透↙ 提交于 2020-01-17 07:49:08
问题 I am learning Doc2Vec model from gensim library and using it as follows: class MyTaggedDocument(object): def __init__(self, dirname): self.dirname = dirname def __iter__(self): for fname in os.listdir(self.dirname): with open(os.path.join(self.dirname, fname),encoding='utf-8') as fin: print(fname) for item_no, sentence in enumerate(fin): yield LabeledSentence([w for w in sentence.lower().split() if w in stopwords.words('english')], [fname.split('.')[0].strip() + '_%s' % item_no]) sentences =

what is the minimum dataset size needed for good performance with doc2vec?

流过昼夜 提交于 2020-01-02 03:42:09
问题 How does doc2vec perform when trained on different sized datasets? There is no mention of dataset size in the original corpus, so I am wondering what is the minimum size required to get good performance out of doc2vec. 回答1: A bunch of things have been called 'doc2vec', but it seems to most-often refer to the 'Paragraph Vector' technique from Le and Mikolov. The original 'Paragraph Vector' paper describes evaluating it on three datasets: 'Stanford Sentiment Treebank': 11,825 sentences of movie

Doc2vec

与世无争的帅哥 提交于 2019-12-31 02:09:12
目录 一:背景 二:基本原理 2.1:PV-DM 2.2:PV-DBOW 2.3:和word2vec区别 2.4:预测新文本的向量 三:代码实战 3.1:接口介绍 3.2:主要代码 一:背景 之前总结了Word2vec训练词向量的细节,讲解了一个词是如何通过word2vec模型训练出唯一的向量来表示的。那接着可能就会想到,有没有什么办法能够将一个句子甚至一篇短文也用一个向量来表示呢?答案是肯定有的,Doc2vec就是常用的算法之一。许多机器学习算法需要的输入是一个固定长度的向量,当涉及到短文时,最常用的固定长度的向量方法是词袋模型(bag-of-words)。尽管它很流行,但是词袋模型存在两个主要的缺点: 一个是词袋模型忽略词序,如果两个不同的句子由相同的词但是顺序不同组成,词袋模型会将这两句话定义为同一个表达 ; 另一个是词袋模型忽略了句法,这样训练出来的模型会造成类似'powerful','strong'和'Paris'的距离是相同的,而其实'powerful'应该相对于'Paris'距离'strong'更近才对 。 Doc2Vec 或者叫做 paragraph2vec, sentence embeddings,是一种非监督式算法,可以获得 sentences/paragraphs/documents 的向量表达,是 word2vec 的拓展。学出来的向量可以通过计算距离来找

Gensim doc2vec file stream training worse performance

大兔子大兔子 提交于 2019-12-24 18:38:55
问题 Recently I switched to gensim 3.6 and the main reason was the optimized training process, which streams the training data directly from file, thus avoiding the GIL performance penalties. This is how I used to trin my doc2vec: training_iterations = 20 d2v = Doc2Vec(vector_size=200, workers=cpu_count(), alpha=0.025, min_alpha=0.00025, dm=0) d2v.build_vocab(corpus) for epoch in range(training_iterations): d2v.train(corpus, total_examples=d2v.corpus_count, epochs=d2v.iter) d2v.alpha -= 0.0002 d2v

Doc2vec: model.docvecs is only of length 10

∥☆過路亽.° 提交于 2019-12-22 18:14:59
问题 I am trying doc2vec for 600000 rows of sentences and my code is below: model = gensim.models.doc2vec.Doc2Vec(size= 100, min_count = 5,window=4, iter = 50, workers=cores) model.build_vocab(res) model.train(res, total_examples=model.corpus_count, epochs=model.iter) #len(res) = 663406 #length of unique words 15581 print(len(model.wv.vocab)) #length of doc vectors is 10 len(model.docvecs) # each of length 100 len(model.docvecs[1]) How do I interpret this result? why is the length of vector only

What are doc2vec training iterations?

泄露秘密 提交于 2019-12-22 10:29:50
问题 I am new to doc2vec. I was initially trying to understand doc2vec and mentioned below is my code that uses Gensim. As I want I get a trained model and document vectors for the two documents. However, I would like to know the benefits of retraining the model in several epoches and how to do it in Gensim? Can we do it using iter or alpha parameter or do we have to train it in a seperate for loop ? Please let me know how I should change the following code to train the model for 20 epoches. Also,