How is the Tf-Idf value calculated with analyzer ='char'?

前端 未结 1 566
隐瞒了意图╮
隐瞒了意图╮ 2020-12-11 20:48

I\'m having a problem in understanding how we got the Tf-Idf in the following program:

I have tried calculating the value of a in the document 2 (

相关标签:
1条回答
  • 2020-12-11 20:56

    The TfidfVectorizer() has smoothing added to the document counts and l2 normalization been applied on top tf-idf vector, as mentioned in the documentation.

    (count of occurrence of the character)/(no of characters in the given document) *
    log (1 + # Docs / 1 + # Docs in which the given character is present) +1 )

    This Normalization is l2 by default, but you can change or remove this step by using the parameter norm. Similarly, smoothing can be

    To understand how does the exact score is computed, I am going to fit a CountVectorizer() to know the counts of each character in every document.

    countVectorizer = CountVectorizer(analyzer='char')
    tf = countVectorizer.fit_transform(corpus)
    tf_df = pd.DataFrame(tf.toarray(),
                         columns= countVectorizer.get_feature_names())
    tf_df
    
    #output:
       .  ?  _  a  c  d  e  f  h  i  m  n  o  r  s  t  u
    0  1  0  4  0  1  1  2  1  2  3  1  1  1  1  3  4  1
    1  1  0  5  0  3  3  4  0  2  2  2  3  3  0  3  4  2
    2  1  0  5  1  0  2  2  0  3  3  0  2  1  1  2  3  0
    3  0  1  4  0  1  1  2  1  2  3  1  1  1  1  3  4  1
    

    Let us apply the tf-idf weighting based on sklearn implementation now for the second document now!

    v=[]
    doc_id = 2
    # number of documents in the corpus + smoothing
    n_d = 1+ tf_df.shape[0]
    
    for char in tf_df.columns:
        # calculate tf - count of this char in the doc / total number chars in the doc
        tf = tf_df.loc[doc_id,char]/tf_df.loc[doc_id,:].sum()
    
        # number of documents containing this char with smoothing 
        df_d_t = 1+ sum(tf_df.loc[:,char]>0)
        # now calculate the idf with smoothing 
        idf = (np.log (n_d/df_d_t) + 1 )
    
        # calculate the score now
        v.append (tf*idf)
    
    from sklearn.preprocessing import normalize
    
    # normalize the vector with l2 norm and create a dataframe with feature_names
    
    pd.DataFrame(normalize([v], norm='l2'), columns=vectorizer.get_feature_names())
    
    #output:
    
           .    ?        _         a    c         d         e    f         h        i    m         n         o         r         s         t    u  
     0.140615  0.0  0.57481  0.220301  0.0  0.229924  0.229924  0.0  0.344886   0.344886  0.0  0.229924  0.114962  0.140615  0.229924  0.344886  0.0 
    

    you could find that the score for char a matches with the TfidfVectorizer() output!!!

    0 讨论(0)
提交回复
热议问题