可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

this page: http://scikit-learn.org/stable/modules/feature_extraction.html mentions:

TfidfVectorizer that combines all the option of CountVectorizer and TfidfTransformer in a single model.

then I followed the code and use fit_transform() on my corpus. How to get the weight of each feature computed by fit_transform()?

I tried:

In [39]: vectorizer.idf_ --------------------------------------------------------------------------- AttributeError                            Traceback (most recent call last)  in () ----> 1 vectorizer.idf_  AttributeError: 'TfidfVectorizer' object has no attribute 'idf_'

but this attribute is missing.

Thanks

回答1:

Since version 0.15, the tf-idf score of each feature can be retrieved via the attribute idf_ of the TfidfVectorizer object:

from sklearn.feature_extraction.text import TfidfVectorizer corpus = ["This is very strange",           "This is very nice"] vectorizer = TfidfVectorizer(min_df=1) X = vectorizer.fit_transform(corpus) idf = vectorizer.idf_ print dict(zip(vectorizer.get_feature_names(), idf))

Output:

{u'is': 1.0,  u'nice': 1.4054651081081644,  u'strange': 1.4054651081081644,  u'this': 1.0,  u'very': 1.0}

As discussed in the comments, prior to version 0.15, a workaround is to access the attribute idf_ via the supposedly hidden _tfidf (an instance of TfidfTransformer) of the vectorizer:

idf = vectorizer._tfidf.idf_ print dict(zip(vectorizer.get_feature_names(), idf))

which should give the same output as above.

回答2:

See also this on how to get the TF-IDF values of all the documents:

feature_names = tf.get_feature_names() doc = 0 feature_index = X[doc,:].nonzero()[1] tfidf_scores = zip(feature_index, [X[doc, x] for x in feature_index]) for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:     print w, s  this 0.448320873199 is 0.448320873199 very 0.448320873199 strange 0.630099344518  #and for doc=1 this 0.448320873199 is 0.448320873199 very 0.448320873199 nice 0.630099344518

I think the results are normalized by document:

>>>0.4483208731992+0.4483208731992+0.4483208731992+0.6300993445182 0.9999999999997548

文章来源: tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer

标签

tf-idf

idf