How is TF-IDF implemented in gensim tool in python?

问题

From the documents which i found out from the net i figured out the expression used to determine the Term Frequency and Inverse Document frequency weights of terms in a corpus to be

tf-idf(wt)= tf * log(|N|/d);

I was going through the implementation of tf-idf mentioned in gensim. The example given in the documentation is

>>> doc_bow = [(0, 1), (1, 1)]
>>> print tfidf[doc_bow] # step 2 -- use the model to transform vectors
[(0, 0.70710678), (1, 0.70710678)]

Which apparently does not follow the standard implementation of Tf-IDF. What is the difference between both the models?

Note: 0.70710678 is the value 2^(-1/2) which is used usually in eigen value calculation. So how does eigen value come into the TF-IDF model?

回答1:

From Wikipedia:

The term count in the given document is simply the number of times a given term appears in that document. This count is usually normalized to prevent a bias towards longer documents (which may have a higher term count regardless of the actual importance of that term in the document)

From the gensim source lines 126-127:

if self.normalize:
        vector = matutils.unitvec(vector)

回答2:

There are two tokens in a bag of words (doc_bow), t0 and t1. We don't know if t0 and t1 appear in a document or in two documents. And we even don't know whether the model (tfidf) built over documents containing the tokens. The bag, doc_bow is just a query (test data) and the model built from a training data which may or may not contain any of t0 or t1.

So let's make an assumption. The model tfidf built over 2 documents, d0 and d1, and d0 contains t0, d1 contains t1. So, total number of documents (N) is 2, term frequency and document frequency of t0 and t1 become 1.

Gensim uses log base 2 for calculating IDF as default (refer to df2idf function) and the transformed tfidf vector from doc_bow would be like [(0, 1), (0, 1)]. (ex. tfidf(t0) = 1 * log_2(|2|/1) = 1)

Plus, we need to consider the L2 normalization performed by default and the final output becomes [(0, 1 / 2^(1/2)), (0, 1 / 2^(1/2))] .

来源：https://stackoverflow.com/questions/9470479/how-is-tf-idf-implemented-in-gensim-tool-in-python

标签

python

tf-idf

latent-semantic-indexing

gensim