Why is the value of TF-IDF different from IDF_?

半世苍凉 提交于 2019-12-02 02:45:07

问题


Why is the value of the vectorized corpus different from the value obtained through the idf_ attribute? Should not the idf_ attribute just return the inverse document frequency (IDF) in the same way it appears in the corpus vectorized?

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
          "This is very nice"]
vectorizer = TfidfVectorizer()
corpus = vectorizer.fit_transform(corpus)

print(corpus)

Corpus vectorized:

  (0, 2)    0.6300993445179441
  (0, 4)    0.44832087319911734
  (0, 0)    0.44832087319911734
  (0, 3)    0.44832087319911734
  (1, 1)    0.6300993445179441
  (1, 4)    0.44832087319911734
  (1, 0)    0.44832087319911734
  (1, 3)    0.44832087319911734

Vocabulary and idf_ values:

print(dict(zip(vectorizer.vocabulary_, vectorizer.idf_)))

Output:

{'this': 1.0, 
 'is': 1.4054651081081644, 
 'very': 1.4054651081081644, 
 'strange': 1.0, 
 'nice': 1.0}

Vocabulary index:

print(vectorizer.vocabulary_)

Output:

{'this': 3, 
 'is': 0, 
 'very': 4, 
 'strange': 2, 
 'nice': 1}

Why is the IDF value of the word this is 0.44 in the corpus and 1.0 when obtained by idf_?


回答1:


This is because of l2 normalization, which is applied by default in TfidfVectorizer(). If you set the norm param as None, you will get the same values as idf_.


>>> vectorizer = TfidfVectorizer(norm=None)

#output

  (0, 2)    1.4054651081081644
  (0, 4)    1.0
  (0, 0)    1.0
  (0, 3)    1.0
  (1, 1)    1.4054651081081644
  (1, 4)    1.0
  (1, 0)    1.0
  (1, 3)    1.0

Also, your way to computing the feature's corresponding idf values is wrong because dict does not preserve the order.

use:

 >>>> print(dict(zip(vectorizer.get_feature_names(), vectorizer.idf_)))

     {'is': 1.0,
      'nice': 1.4054651081081644, 
      'strange': 1.4054651081081644, 
      'this': 1.0, 
      'very': 1.0}


来源:https://stackoverflow.com/questions/56653159/why-is-the-value-of-tf-idf-different-from-idf

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!