TFIDF calculating confusion

本小妞迷上赌 提交于 2019-12-04 14:14:30

问题


I found the following code on the internet for calculating TFIDF:

https://github.com/timtrueman/tf-idf/blob/master/tf-idf.py

I added "1+" in the function def idf(word, documentList) so i won't get divided by 0 error:

return math.log(len(documentList) / (1 + float(numDocsContaining(word,documentList))))

But i am confused for two things:

  1. I get negative values in some cases, is this correct?
  2. I am confused with line 62, 63 and 64.

Code:

 documentNumber = 0
  for word in documentList[documentNumber].split(None):
       words[word] = tfidf(word,documentList[documentNumber],documentList)

Should TFIDF be calculated on the first document only?


回答1:


  1. No. Tf-idf is tf, a non-negative value, times idf, a non-negative value, so it can never be negative. This code seems to be implementing the erroneous definition of tf-idf that's been on the Wikipedia for years (it's been fixed in the meantime).



回答2:


If the word in question is contained in every document in the collection your 1+ change will result in a negative value. As 0 < (x / (1 + x)) < 1 holds for all x > 0. Which results in a negative logarithm.

In my opinion the correct IDF for a nonexistent word is infinite or undefined, but by adding 1+ to the denominator and the nominator a nonexistent word will have an IDF slightly higher than any existing word and words that exist in every document will have an IDF of zero. Both cases will probably work well with your code.



来源:https://stackoverflow.com/questions/16648599/tfidf-calculating-confusion

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!