问题
I am trying to fit tfidf vectorizer on a certain text corpus and then use the same vectorizer to find the sum of tfidf values of the new text.However, the sum values are not as expected. Below is the example:
text = ["I am new to python and R , how can anyone help me","why is no one able to crack the python code without help"]
tf= TfidfVectorizer(stop_words='english',ngram_range =(1,1))
tf.fit_transform(text)
zip(tf.get_feature_names(),tf.idf_)
[(u'able', 1.4054651081081644),
(u'code', 1.4054651081081644),
(u'crack', 1.4054651081081644),
(u'help', 1.0),
(u'new', 1.4054651081081644),
(u'python', 1.0)]
Now when i try the same tf
with new text:
new_text = "i am not able to code"
np.sum(tf.transform([new_text]))
1.4142135623730951
I am expecting the output to be around 2.80.any suggestion on what might be going wrong here would be really helpful.
回答1:
This is because of the 'l2 normalization' (default in TfidfVectorizer).
As you expect, the first result of the transform()
is:
array([[ 1.40546511, 1.40546511, 0. , 0. , 0. ,
0. ]])
But now the normalization is done. In this, the above vector is divided by the divider:
dividor = sqrt(sqr(1.40546511)+sqr(1.40546511)+sqr(0)+sqr(0)+sqr(0)+sqr(0))
= sqrt(1.975332175+1.975332175+0+0+0+0)
= 1.98762782
So the resulting final array is:
array([[ 0.70710678, 0.70710678, 0. , 0. , 0. ,
0. ]])
And then you apply sum, its result is = 1.4142135623730951
.
Hope it is clear now. You can refer to my answer here for complete working of TfidfVectorizer.
来源:https://stackoverflow.com/questions/43091235/tfidf-transform-function-not-returning-correct-values