发表新帖

发表新帖

Understanding the output of Doc2Vec from Gensim package

后端未结

关注

 2  403

I have some sample sentences that I want to run through a Doc2Vec model. My end goal is a matrix of size (num_sentences, num_features).

I\'m using the Gensim packag

相关标签:

2条回答

夕颜

2021-01-05 10:05
TaggedDocument expects tags to be a list of tags related to document.

In your case,
```
sentence = TaggedDocument(words=['a', 'b'], tags='400')
```
gets interpreted as sentence having 3 tags ['4','0','0'], and hence model.docvecs returns vectors corresponding to 10 tags - ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

Try changing this to
```
sentence = TaggedDocument(words=['a', 'b'], tags=['400'])
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
别那么骄傲

2021-01-05 10:14
model.docvecs is an iterable with length equal to the number of documents you supplied the model. Each docvec is a vector representation of a single document. Its length is determined by the size parameter that you gave it when you trained the model. size is commonly between 100 and 300, and sometimes longer. A vector of length 10 would do a poor job at representing the documents you fed it.

Thus, something like this would be more productive:
```
for i in range(0, len(lot)):
    docs.append(gn.models.doc2vec.TaggedDocument(words=lot[i], tags=[i]))
```
Where lot is a list of lists of tokens (words) like this:
```
lot = [['the','cat','sat'],['the','dog','ran']]
```
Running the model:
```
gn.models.doc2vec.Doc2Vec(docs, size=300, window=8, dm=1, hs=1, alpha=.025, min_alpha=.0001)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题