gensim word2vec: Find number of words in vocabulary

前端 未结 2 387
鱼传尺愫
鱼传尺愫 2021-01-31 07:46

After training a word2vec model using python gensim, how do you find the number of words in the model\'s vocabulary?

相关标签:
2条回答
  • 2021-01-31 08:48

    One more way to get the vocabulary size is from the embedding matrix itself as in:

    In [33]: from gensim.models import Word2Vec
    
    # load the pretrained model
    In [34]: model = Word2Vec.load(pretrained_model)
    
    # get the shape of embedding matrix    
    In [35]: model.wv.vectors.shape
    Out[35]: (662109, 300)
    
    # `vocabulary_size` is just the number of rows (i.e. axis 0)
    In [36]: model.wv.vectors.shape[0]
    Out[36]: 662109
    
    0 讨论(0)
  • 2021-01-31 08:49

    The vocabulary is in the vocab field of the Word2Vec model's wv property, as a dictionary, with the keys being each token (word). So it's just the usual Python for getting a dictionary's length:

    len(w2v_model.wv.vocab)
    

    (In older gensim versions before 0.13, vocab appeared directly on the model. So you would use w2v_model.vocab instead of w2v_model.wv.vocab.)

    0 讨论(0)
提交回复
热议问题