Why is the value of the vectorized corpus different from the value obtained through the idf_
attribute? Should not the idf_
attribute just return the inverse document frequency (IDF) in the same way it appears in the corpus vectorized?
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
"This is very nice"]
vectorizer = TfidfVectorizer()
corpus = vectorizer.fit_transform(corpus)
print(corpus)
Corpus vectorized:
(0, 2) 0.6300993445179441
(0, 4) 0.44832087319911734
(0, 0) 0.44832087319911734
(0, 3) 0.44832087319911734
(1, 1) 0.6300993445179441
(1, 4) 0.44832087319911734
(1, 0) 0.44832087319911734
(1, 3) 0.44832087319911734
Vocabulary and idf_
values:
print(dict(zip(vectorizer.vocabulary_, vectorizer.idf_)))
Output:
{'this': 1.0,
'is': 1.4054651081081644,
'very': 1.4054651081081644,
'strange': 1.0,
'nice': 1.0}
Vocabulary index:
print(vectorizer.vocabulary_)
Output:
{'this': 3,
'is': 0,
'very': 4,
'strange': 2,
'nice': 1}
Why is the IDF value of the word this
is 0.44
in the corpus and 1.0
when obtained by idf_
?
This is because of l2
normalization, which is applied by default in TfidfVectorizer()
.
If you set the norm
param as None
, you will get the same values as idf_
.
>>> vectorizer = TfidfVectorizer(norm=None)
#output
(0, 2) 1.4054651081081644
(0, 4) 1.0
(0, 0) 1.0
(0, 3) 1.0
(1, 1) 1.4054651081081644
(1, 4) 1.0
(1, 0) 1.0
(1, 3) 1.0
Also, your way to computing the feature's corresponding idf values is wrong because dict
does not preserve the order.
use:
>>>> print(dict(zip(vectorizer.get_feature_names(), vectorizer.idf_)))
{'is': 1.0,
'nice': 1.4054651081081644,
'strange': 1.4054651081081644,
'this': 1.0,
'very': 1.0}
来源:https://stackoverflow.com/questions/56653159/why-is-the-value-of-tf-idf-different-from-idf