Effects of Stemming on the term frequency?

若如初见. 提交于 2019-11-29 08:54:27

问题


How are the term frequencies (TF), and inverse document frequency (IDF), affected by stop-word removal and stemming?

Thanks!


回答1:


tf is term frequency idf is inverse document frequency which is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.

stemming effect is grouping all words which are derived from the same stem (ex: played, play,..), this grouping will increase the occurrence of this stem because frequencies are calculated using stem not words, For example, if you have 2 documents: the first one contains 'play' 2 times and 'played' 5 times, and the second document contains 'play' 3 times and 'played' 1 time if you do a search for 'play' without stemming the second document will be first because it has more occurrence of the word 'play', while if you do stemming, both words will be 'play' after stemming and the first document will be first cause it contains the stem play 7 times and the second document contains the stem play 4 times.

Concerning stopwords removal, it is found frequently in all document and isn't consider as a keyword for any of them, it will have high freq without any scene.



来源:https://stackoverflow.com/questions/10464265/effects-of-stemming-on-the-term-frequency

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!