问题
I am trying doc2vec for 600000 rows of sentences and my code is below:
model = gensim.models.doc2vec.Doc2Vec(size= 100, min_count = 5,window=4, iter = 50, workers=cores)
model.build_vocab(res)
model.train(res, total_examples=model.corpus_count, epochs=model.iter)
#len(res) = 663406
#length of unique words 15581
print(len(model.wv.vocab))
#length of doc vectors is 10
len(model.docvecs)
# each of length 100
len(model.docvecs[1])
How do I interpret this result? why is the length of vector only 10 with each of size 100? when the length of 'res' is 663406, it does not make sense. I know something is wrong here.
In Understanding the output of Doc2Vec from Gensim package, they mention that the length of docvec is determined by 'size' which is not clear.
回答1:
The tags
of a TaggedDocument
should be a list-of-tags. If you instead provided strings, like tags='73215'
, that would be seen as if the same as the list-of-characters:
tags=['7', '3', '2', '1', '5']
At the end, you'd only have 10 tags in your whole training set, just the 10 digits in various combinations.
That your len(model.docvec[1])
is 100 means you didn't make exactly this error, but perhaps something similar, in constructing your TaggedDocument
training data.
Look at the first item in res
, to see if its tags
property makes sense, and each of the model.docvecs
, to see what's being used instead of what you intended.
来源:https://stackoverflow.com/questions/47929028/doc2vec-model-docvecs-is-only-of-length-10