I\'m having a problem in understanding how we got the Tf-Idf in the following program:
I have tried calculating the value of a
in the document 2 (
The TfidfVectorizer()
has smoothing added to the document counts and l2
normalization been applied on top tf-idf vector, as mentioned in the documentation.
(count of occurrence of the character)/(no of characters in the given document) *
log (1 + # Docs / 1 + # Docs in which the given character is present) +1 )
This Normalization is l2
by default, but you can change or remove this step by using the parameter norm
. Similarly, smoothing can be
To understand how does the exact score is computed, I am going to fit a CountVectorizer()
to know the counts of each character in every document.
countVectorizer = CountVectorizer(analyzer='char')
tf = countVectorizer.fit_transform(corpus)
tf_df = pd.DataFrame(tf.toarray(),
columns= countVectorizer.get_feature_names())
tf_df
#output:
. ? _ a c d e f h i m n o r s t u
0 1 0 4 0 1 1 2 1 2 3 1 1 1 1 3 4 1
1 1 0 5 0 3 3 4 0 2 2 2 3 3 0 3 4 2
2 1 0 5 1 0 2 2 0 3 3 0 2 1 1 2 3 0
3 0 1 4 0 1 1 2 1 2 3 1 1 1 1 3 4 1
Let us apply the tf-idf weighting based on sklearn implementation now for the second document now!
v=[]
doc_id = 2
# number of documents in the corpus + smoothing
n_d = 1+ tf_df.shape[0]
for char in tf_df.columns:
# calculate tf - count of this char in the doc / total number chars in the doc
tf = tf_df.loc[doc_id,char]/tf_df.loc[doc_id,:].sum()
# number of documents containing this char with smoothing
df_d_t = 1+ sum(tf_df.loc[:,char]>0)
# now calculate the idf with smoothing
idf = (np.log (n_d/df_d_t) + 1 )
# calculate the score now
v.append (tf*idf)
from sklearn.preprocessing import normalize
# normalize the vector with l2 norm and create a dataframe with feature_names
pd.DataFrame(normalize([v], norm='l2'), columns=vectorizer.get_feature_names())
#output:
. ? _ a c d e f h i m n o r s t u
0.140615 0.0 0.57481 0.220301 0.0 0.229924 0.229924 0.0 0.344886 0.344886 0.0 0.229924 0.114962 0.140615 0.229924 0.344886 0.0
you could find that the score for char a
matches with the TfidfVectorizer()
output!!!