what is the difference between tfidf vectorizer and tfidf transformer

跟風遠走 提交于 2020-01-16 19:12:24

问题


I know that the formula for tfidf vectorizer is

Count of word/Total count * log(Number of documents / no.of documents where word is present)

I saw there's tfidf transformer in the scikit learn and I just wanted to difference between them. I could't find anything that's helpful.


回答1:


TfidfVectorizer is used on sentences, while TfidfTransformer is used on an existing count matrix, such as one returned by CountVectorizer




回答2:


Artem's answer pretty much sums up the difference. To make things clearer here is an example as referenced from here.

TfidfTransformer can be used as follows:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer


train_set = ["The sky is blue.", "The sun is bright."] 

vectorizer = CountVectorizer(stop_words='english')
trainVectorizerArray =   vectorizer.fit_transform(article_master['stemmed_content'])

transformer = TfidfTransformer()
res = transformer.fit_transform(trainVectorizerArray)

print ((res.todense()))


## RESULT:  

Fit Vectorizer to train set
[[1 0 1 0]
 [0 1 0 1]]

[[0.70710678 0.         0.70710678 0.        ]
 [0.         0.70710678 0.         0.70710678]]

Extraction of count features, TF-IDF normalization and row-wise euclidean normalization can be done in one operation with TfidfVectorizer:

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')
res1 = tfidf.fit_transform(train_set)
print ((res1.todense()))


## RESULT:  

[[0.70710678 0.         0.70710678 0.        ]
 [0.         0.70710678 0.         0.70710678]]

Both processes produce a sparse matrix comprising of the same values.
Other useful references would be tfidfTransformer.fit_transform, countVectoriser_fit_transform and tfidfVectoriser .




回答3:


With Tfidftransformer you will compute word counts using CountVectorizer and then compute the IDF values and only then compute the Tf-idf scores. With Tfidfvectorizer you will do all three steps at once.

I think you should read this article which sums it up with an example.



来源:https://stackoverflow.com/questions/54745482/what-is-the-difference-between-tfidf-vectorizer-and-tfidf-transformer

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!