tfidf.transform() function not returning correct values

浪尽此生 提交于 2019-12-02 03:20:49

问题


I am trying to fit tfidf vectorizer on a certain text corpus and then use the same vectorizer to find the sum of tfidf values of the new text.However, the sum values are not as expected. Below is the example:

text = ["I am new to python and R , how can anyone help me","why is no one able to crack the python code without help"]
tf= TfidfVectorizer(stop_words='english',ngram_range =(1,1))
tf.fit_transform(text)
zip(tf.get_feature_names(),tf.idf_)

[(u'able', 1.4054651081081644),
 (u'code', 1.4054651081081644),
 (u'crack', 1.4054651081081644),
 (u'help', 1.0),
 (u'new', 1.4054651081081644),
 (u'python', 1.0)]

Now when i try the same tf with new text:

new_text = "i am not able to code"
np.sum(tf.transform([new_text]))
1.4142135623730951

I am expecting the output to be around 2.80.any suggestion on what might be going wrong here would be really helpful.


回答1:


This is because of the 'l2 normalization' (default in TfidfVectorizer). As you expect, the first result of the transform() is:

array([[ 1.40546511,  1.40546511,  0.        ,  0.        ,  0.        ,
     0.        ]])

But now the normalization is done. In this, the above vector is divided by the divider:

dividor = sqrt(sqr(1.40546511)+sqr(1.40546511)+sqr(0)+sqr(0)+sqr(0)+sqr(0))
        = sqrt(1.975332175+1.975332175+0+0+0+0)
        = 1.98762782

So the resulting final array is:

array([[ 0.70710678,  0.70710678,  0.        ,  0.        ,  0.        ,
     0.        ]])

And then you apply sum, its result is = 1.4142135623730951.

Hope it is clear now. You can refer to my answer here for complete working of TfidfVectorizer.



来源:https://stackoverflow.com/questions/43091235/tfidf-transform-function-not-returning-correct-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!