After training a word2vec model using python gensim, how do you find the number of words in the model\'s vocabulary?
One more way to get the vocabulary size is from the embedding matrix itself as in:
In [33]: from gensim.models import Word2Vec
# load the pretrained model
In [34]: model = Word2Vec.load(pretrained_model)
# get the shape of embedding matrix
In [35]: model.wv.vectors.shape
Out[35]: (662109, 300)
# `vocabulary_size` is just the number of rows (i.e. axis 0)
In [36]: model.wv.vectors.shape[0]
Out[36]: 662109
The vocabulary is in the vocab
field of the Word2Vec model's wv
property, as a dictionary, with the keys being each token (word). So it's just the usual Python for getting a dictionary's length:
len(w2v_model.wv.vocab)
(In older gensim versions before 0.13, vocab
appeared directly on the model. So you would use w2v_model.vocab
instead of w2v_model.wv.vocab
.)