Gensim: Any chance to get word frequency in Word2Vec format?

问题

I am doing my research with fasttext pre-trained model and I need word frequency to do further analysis. Does the .vec or .bin files provided on fasttext website contain the info of word frequency? if yes, how do I get?

I am using load_word2vec_format to load the model tried using model.wv.vocab[word].count, which only gives you the word frequency rank not the original word frequency.

回答1:

I don't believe those formats include any word frequency information.

To the extent any pre-trained word-vectors declare what they were trained on – like, say, Wikipedia text – you could go back to the training corpus (or some reasonable approximation) to perform your own frequency-count. Even if you've only got a "similar" corpus, the frequencies might be "close enough" for your analytical need.

Similarly, you could potentially use the frequency-rank to synthesize a dummy frequency table, using Zipf's Law, which roughly holds for normal natural-language corpora. Again, the relative proportions between words might be roughly close enough to the real proportions for your need, even with real/precise frequencies as were used during word-vector training.

Synthesizing the version of the Zipf's law formula on the Wikipedia page that makes use of the Harmonic number (H) in the denominator, with the efficient approximation of H given in this answer, we can create a function that, given a word's (starting at 1) rank and the total number of unique words, gives the proportionate frequency predicted by Zipf's law:

from numpy import euler_gamma
from scipy.special import digamma

def digamma_H(s):
    """ If s is complex the result becomes complex. """
    return digamma(s + 1) + euler_gamma

def zipf_at(k_rank, N_total):
    return 1.0 / (k_rank * digamma_H(N_total))

Then, if you had a pretrained set of 1 million word-vectors, you could estimate the first word's frequency as:

>>> zipf_at(1, 1000000)
0.06947953777315177

来源：https://stackoverflow.com/questions/58735585/gensim-any-chance-to-get-word-frequency-in-word2vec-format

标签

python-3.6

gensim

fasttext