what is the difference between tfidf vectorizer and tfidf transformer

跟風遠走 提交于 2020-01-16 19:12:24


I know that the formula for tfidf vectorizer is

Count of word/Total count * log(Number of documents / no.of documents where word is present)

I saw there's tfidf transformer in the scikit learn and I just wanted to difference between them. I could't find anything that's helpful.


TfidfVectorizer is used on sentences, while TfidfTransformer is used on an existing count matrix, such as one returned by CountVectorizer


Artem's answer pretty much sums up the difference. To make things clearer here is an example as referenced from here.

TfidfTransformer can be used as follows:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

train_set = ["The sky is blue.", "The sun is bright."] 

vectorizer = CountVectorizer(stop_words='english')
trainVectorizerArray =   vectorizer.fit_transform(article_master['stemmed_content'])

transformer = TfidfTransformer()
res = transformer.fit_transform(trainVectorizerArray)

print ((res.todense()))

## RESULT:  

Fit Vectorizer to train set
[[1 0 1 0]
 [0 1 0 1]]

[[0.70710678 0.         0.70710678 0.        ]
 [0.         0.70710678 0.         0.70710678]]

Extraction of count features, TF-IDF normalization and row-wise euclidean normalization can be done in one operation with TfidfVectorizer:

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')
res1 = tfidf.fit_transform(train_set)
print ((res1.todense()))

## RESULT:  

[[0.70710678 0.         0.70710678 0.        ]
 [0.         0.70710678 0.         0.70710678]]

Both processes produce a sparse matrix comprising of the same values.
Other useful references would be tfidfTransformer.fit_transform, countVectoriser_fit_transform and tfidfVectoriser .


With Tfidftransformer you will compute word counts using CountVectorizer and then compute the IDF values and only then compute the Tf-idf scores. With Tfidfvectorizer you will do all three steps at once.

I think you should read this article which sums it up with an example.

