doc2vec

gensim - Doc2Vec: MemoryError when training on english Wikipedia

你说的曾经没有我的故事 提交于 2019-12-20 04:50:23
问题 I extracted 145,185,965 sentences (14GB) out of the english wikipedia dump and I want to train a Doc2Vec model based on these sentences. Unfortunately I have 'only' 32GB of RAM and get a MemoryError when trying to train. Even if I set the min_count to 50, gensim tells me that it would need over 150GB of RAM. I don't think that further increasing the min_count would be a good idea, because the resulting model would be not very good (just a guess). But anyways, I will try it with 500 to see if

gensim doc2vec “intersect_word2vec_format” command

大城市里の小女人 提交于 2019-12-20 03:55:16
问题 Just reading through the doc2vec commands on the gensim page. I am curious about the command"intersect_word2vec_format" . My understanding of this command is it lets me inject vector values from a pretrained word2vec model into my doc2vec model and then train my doc2vec model using the pretrained word2vec values rather than generating the word vector values from my document corpus. The result is that I get a more accurate doc2vec model because I am using pretrained w2v values which was

How are word vectors co-trained with paragraph vectors in doc2vec DBOW?

。_饼干妹妹 提交于 2019-12-13 19:29:02
问题 I don't understand how word vectors are involved at all in the training process with gensim's doc2vec in DBOW mode ( dm=0 ). I know that it's disabled by default with dbow_words=0 . But what happens when we set dbow_words to 1? In my understanding of DBOW, the context words are predicted directly from the paragraph vectors. So the only parameters of the model are the N p -dimensional paragraph vectors plus the parameters of the classifier. But multiple sources hint that it is possible in DBOW

How to perform efficient queries with Gensim doc2vec?

て烟熏妆下的殇ゞ 提交于 2019-12-12 16:33:15
问题 I’m working on a sentence similarity algorithm with the following use case: given a new sentence, I want to retrieve its n most similar sentences from a given set. I am using Gensim v.3.7.1, and I have trained both word2vec and doc2vec models. The results of the latter outperform word2vec’s, but I’m having trouble performing efficient queries with my Doc2Vec model. This model uses the distributed bag of words implementation (dm = 0). I used to infer similarity using the built in method model

Issues in doc2vec tags in Gensim

大城市里の小女人 提交于 2019-12-12 04:46:56
问题 I am using gensim doc2vec as below. from gensim.models import doc2vec from collections import namedtuple import re my_d = {'recipe__001__1': 'recipe 1 details should come here', 'recipe__001__2': 'Ingredients of recipe 2 need to be added'} docs = [] analyzedDocument = namedtuple('AnalyzedDocument', 'words tags') for key, value in my_d.items(): value = re.sub("[^a-zA-Z]"," ", value) words = value.lower().split() tags = key docs.append(analyzedDocument(words, tags)) model = doc2vec.Doc2Vec(docs

How to find most similar terms/words of a document in doc2vec? [duplicate]

爱⌒轻易说出口 提交于 2019-12-12 04:08:49
问题 This question already has answers here : How to intrepret Clusters results after using Doc2vec? (3 answers) Closed 2 years ago . I have applied Doc2vec to convert documents into vectors.After that, I used the vectors in clustering and figured out the 5 nearest/most similar document to the centroid of each cluster. Now I need to find the most dominant or important terms of these documents so that I can figure out the characteristics of each cluster. My question is is there any way to figure

GridSearch for doc2vec model built using gensim

时间秒杀一切 提交于 2019-12-11 08:44:02
问题 I am trying to find best hyperparameters for my trained doc2vec gensim model which takes a document as an input and create its document embeddings. My train data consists of text documents but it doesn't have any labels. i.e. I just have 'X' but not 'y'. I found some questions here related to what I am trying to do but all of the solutions are proposed for supervised models but none for unsupervised like mine. Here is the code where I am training my doc2vec model: def train_doc2vec( self, X:

How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?

假如想象 提交于 2019-12-10 15:51:04
问题 I'm trying to get the text with its punctuation as it is important to consider the latter in my doc2vec model. However, the wikicorpus retrieve only the text. After searching the web I found these pages: Page from gensim github issues section. It was a question by someone where the answer was to subclass WikiCorpus (answered by Piskvorky). Luckily, in the same page, there was a code representing the suggested 'subclass' solution. The code was provided by Rhazegh. (link) Page from

What does epochs mean in Doc2Vec and train when I have to manually run the iteration?

安稳与你 提交于 2019-12-10 15:49:34
问题 I am trying to understand the epochs parameter in the Doc2Vec function and epochs parameter in the train function. In the following code snippet, I manually set up a loop of 4000 iterations. Is it required or passing 4000 as epochs parameter in the Doc2Vec enough? Also how epochs in Doc2Vec is different from epochs in train ? documents = Documents(train_set) model = Doc2Vec(vector_size=100, dbow_words=1, dm=0, epochs=4000, window=5, seed=1337, min_count=5, workers=4, alpha=0.001, min_alpha=0

Gensim: how to retrain doc2vec model using previous word2vec model

旧城冷巷雨未停 提交于 2019-12-08 13:54:35
With Doc2Vec modelling, I have trained a model and saved following files: 1. model 2. model.docvecs.doctag_syn0.npy 3. model.syn0.npy 4. model.syn1.npy 5. model.syn1neg.npy However, I have a new way to label the documents and want to train the model again. since the word vectors already obtained from previous version. Is there any way to reuse that model (e.g., taking the previous w2v results as initial vectors for training)? Any one know how to do it? I've figured out that, we can just load the model and continue to train. model = Doc2Vec.load("old_model") model.train(sentences) 来源: https:/