问题
I am doing my research with fasttext pre-trained model and I need word frequency to do further analysis. Does the .vec or .bin files provided on fasttext website contain the info of word frequency? if yes, how do I get?
I am using load_word2vec_format to load the model tried using model.wv.vocab[word].count, which only gives you the word frequency rank not the original word frequency.
回答1:
I don't believe those formats include any word frequency information.
To the extent any pre-trained word-vectors declare what they were trained on – like, say, Wikipedia text – you could go back to the training corpus (or some reasonable approximation) to perform your own frequency-count. Even if you've only got a "similar" corpus, the frequencies might be "close enough" for your analytical need.
Similarly, you could potentially use the frequency-rank to synthesize a dummy frequency table, using Zipf's Law, which roughly holds for normal natural-language corpora. Again, the relative proportions between words might be roughly close enough to the real proportions for your need, even with real/precise frequencies as were used during word-vector training.
Synthesizing the version of the Zipf's law formula on the Wikipedia page that makes use of the Harmonic number (H) in the denominator, with the efficient approximation of H given in this answer, we can create a function that, given a word's (starting at 1) rank and the total number of unique words, gives the proportionate frequency predicted by Zipf's law:
from numpy import euler_gamma
from scipy.special import digamma
def digamma_H(s):
""" If s is complex the result becomes complex. """
return digamma(s + 1) + euler_gamma
def zipf_at(k_rank, N_total):
return 1.0 / (k_rank * digamma_H(N_total))
Then, if you had a pretrained set of 1 million word-vectors, you could estimate the first word's frequency as:
>>> zipf_at(1, 1000000)
0.06947953777315177
来源:https://stackoverflow.com/questions/58735585/gensim-any-chance-to-get-word-frequency-in-word2vec-format