问题
I need to count the frequency of each word in word2vec
's training model. I want to have output that looks like this:
term count
apple 123004
country 4432180
runs 620102
...
Is it possible to do that? How would I get that data out of word2vec?
回答1:
Which word2vec implementation are you using?
In the popular gensim
library, after a Word2Vec
model has its vocabulary established (either by doing its full training, or after build_vocab()
has been called), the model's wv
property contains a KeyedVectors
-type object, which as a property vocab
which is a dict of Vocab
-type objects, which have a count
property of the word's frequency in the scanned corpus.
So you could get roughly what you seek with something like:
w2v_model = Word2Vec(your_corpus, ...)
for word in w2v_model.wv.vocab:
print((word, w2v_model.wv.vocab[word].count))
Plain sets of word-vectors (such as those loaded via gensim
's load_word2vec_format()
method) won't have accurate counts, but are by convention usually internally ordered from most-frequent to least-frequent.
来源:https://stackoverflow.com/questions/55657062/how-can-i-count-word-frequencies-in-word2vecs-training-model