I searched a lot for understanding this but I am not able to. I understand that by default TfidfVectorizer will apply l2
normalization on term frequency. This article explain the equation of it. I am using TfidfVectorizer on my text written in Gujarati language. Following is details of output about it:
My two documents are:
ખુબ વખાણ કરે છે
ખુબ વધારે છે
The code I am using is:
vectorizer = TfidfVectorizer(tokenizer=tokenize_words, sublinear_tf=True, use_idf=True, smooth_idf=False)
Here, tokenize_words
is my function for tokenizing words.
The list of TF-IDF of my data is:
[[ 0.6088451 0.35959372 0.35959372 0.6088451 0. ]
[ 0. 0.45329466 0.45329466 0. 0.76749457]]
The list of features:
['કરે', 'ખુબ', 'છે.', 'વખાણ', 'વધારે']
The value of idf:
{'વખાણ': 1.6931471805599454, 'છે.': 1.0, 'કરે': 1.6931471805599454, 'વધારે': 1.6931471805599454, 'ખુબ': 1.0}
Please explain me in this example what shall be the term frequency of each term in my both documents.
Ok, Now lets go through the documentation I gave in comments step by step:
Documents:
`ખુબ વખાણ કરે છે
ખુબ વધારે છે`
- Get all unique terms (
features
):['કરે', 'ખુબ', 'છે.', 'વખાણ', 'વધારે']
Calculate frequency of each term in documents:-
a. Each term present in document1
[ખુબ વખાણ કરે છે]
is present once, and વધારે is not present.`b. So the term frequency vector (sorted according to features):
[1 1 1 1 0]
c. Applying steps a and b on document2, we get
[0 1 1 0 1]
d. So our final term-frequency vector is
[[1 1 1 1 0], [0 1 1 0 1]]
Note: This is the term frequency you want
Now find IDF (This is based on features, not on document basis):
idf(term) = log(number of documents/number of documents with this term) + 1
1 is added to the idf value to prevent zero divisions. It is governed by
"smooth_idf"
parameter which is True by default.idf('કરે') = log(2/1)+1 = 0.69314.. + 1 = 1.69314.. idf('ખુબ') = log(2/2)+1 = 0 + 1 = 1 idf('છે.') = log(2/2)+1 = 0 + 1 = 1 idf('વખાણ') = log(2/1)+1 = 0.69314.. + 1 = 1.69314.. idf('વધારે') = log(2/1)+1 = 0.69314.. + 1 = 1.69314..
Note: This corresponds to the data you showed in question.
Now calculate TF-IDF (This again is calculated document-wise, calculated according to sorting of features):
a. For document1:
For 'કરે', tf-idf = tf(કરે) x idf(કરે) = 1 x 1.69314 = 1.69314 For 'ખુબ', tf-idf = tf(કરે) x idf(કરે) = 1 x 1 = 1 For 'છે.', tf-idf = tf(કરે) x idf(કરે) = 1 x 1 = 1 For 'વખાણ', tf-idf = tf(કરે) x idf(કરે) = 1 x 1.69314 = 1.69314 For 'વધારે', tf-idf = tf(કરે) x idf(કરે) = 0 x 1.69314 = 0
So for document1, the final tf-idf vector is
[1.69314 1 1 1.69314 0]
b. Now normalization is done (l2 Euclidean):
dividor = sqrt(sqr(1.69314)+sqr(1)+sqr(1)+sqr(1.69314)+sqr(0)) = sqrt(2.8667230596 + 1 + 1 + 2.8667230596 + 0) = sqrt(7.7334461192) = 2.7809074272977876...
Dividing each element of the tf-idf array with dividor, we get:
[0.6088445 0.3595948 0.3595948548 0.6088445 0]
Note: This is the tfidf of firt document you posted in question.
c. Now do the same steps a and b for document 2, we get:
[ 0. 0.453294 0.453294 0. 0.767494]
Update: About sublinear_tf = True OR False
Your original term frequency vector is [[1 1 1 1 0], [0 1 1 0 1]]
and you are correct in your understanding that using sublinear_tf = True will change the term frequency vector.
new_tf = 1 + log(tf)
Now the above line will only work on non zero elements in the term-frequecny. Because for 0, log(0) is undefined.
And all your non-zero entries are 1. log(1)
is 0 and 1 + log(1) = 1 + 0 = 1`.
You see that the values will remain unchanged for elements with value 1. So your new_tf = [[1 1 1 1 0], [0 1 1 0 1]] = tf(original)
.
Your term frequency is changing due to the sublinear_tf
but it still remains the same.
And hence all below calculations will be same and output is same if you use sublinear_tf=True
OR sublinear_tf=False
.
Now if you change your documents for which the term-frequecy vector contains elements other than 1 and 0, you will get differences using the sublinear_tf
.
Hope your doubts are cleared now.
来源:https://stackoverflow.com/questions/42440621/how-term-frequency-is-calculated-in-tfidfvectorizer