Interpreting the sum of TF-IDF scores of words across documents

半城伤御伤魂 提交于 2019-11-28 09:04:28

The interpretation of TF-IDF in corpus is the highest TF-IDF in corpus for a given term.

Find the Top Words in corpus_tfidf.

    topWords = {}
    for doc in corpus_tfidf:
        for iWord, tf_idf in doc:
            if iWord not in topWords:
                topWords[iWord] = 0

            if tf_idf > topWords[iWord]:
                topWords[iWord] = tf_idf

    for i, item in enumerate(sorted(topWords.items(), key=lambda x: x[1], reverse=True), 1):
        print("%2s: %-13s %s" % (i, dictionary[item[0]], item[1]))
        if i == 6: break

Output comparison cart:
NOTE: Could'n use gensim, to create a matching dictionary with corpus_tfidf.
Can only display Word Indizies.

Question tfidf_saliency   topWords(corpus_tfidf)  Other TF-IDF implentation  
---------------------------------------------------------------------------  
1: Word(7)   0.121        1: Word(13)    0.640    1: paths         0.376019  
2: Word(8)   0.111        2: Word(27)    0.632    2: intersection  0.376019  
3: Word(26)  0.108        3: Word(28)    0.632    3: survey        0.366204  
4: Word(29)  0.100        4: Word(8)     0.628    4: minors        0.366204  
5: Word(9)   0.090        5: Word(29)    0.628    5: binary        0.300815  
6: Word(14)  0.087        6: Word(11)    0.544    6: generation    0.300815  

The calculation of TF-IDF takes always the corpus in account.

Tested with Python:3.4.2

There is two context that saliency can be calculated in them.

  1. saliency in the corpus
  2. saliency in a single document

saliency in the corpus simply can be calculated by counting the appearances of particular word in corpus or by inverse of the counting of the documents that word appears in (IDF=Inverted Document Frequency). Because the words that hold the specific meaning does not appear in everywhere.

saliency in the document is calculated by tf_idf. Because that is composed of two kinds of information. Global information (corpus-based) and local information (document-based).Claiming that "the word with larger in-document frequency is more important in current document" is not completely true or false because it depends on the global saliency of word. In a particular document you have many words like "it, is, am, are ,..." with large frequencies. But these word is not important in any document and you can take them as stop words!

---- edit ---

The denominator (=len(corpus_tfidf)) is a constant value and could be ignored if you want to deal with ordinality rather than cardinality of measurement. On the other hand we know that IDF means Inverted Document Freqeuncy so IDF can be reoresented by 1/DF. We know that the DF is a corpus level value and TF is document level-value. TF-IDF Summation turns document-level TF into Corpus-level TF. Indeed the summation is equal to this formula:

count ( word ) / count ( documents contain word)

This measurement can be called inverse-scattering value. When the value goes up means the words is gathered into smaller subset of documents and vice versa.

I believe that this formula is not so useful.

This is a great discussion. Thanks for starting this thread. The idea of including document length by @avip seems interesting. Will have to experiment and check on the results. In the meantime let me try asking the question a little differently. What are we trying to interpret when querying for TF-IDF relevance scores ?

  1. Possibly trying to understand the word relevance at the document level
  2. Possibly trying to understand the word relevance per Class
  3. Possibly trying to understand the word relevance overall ( in the whole corpus )

     # # features, corpus = 6 documents of length 3
     counts = [[3, 0, 1],
               [2, 0, 0],
               [3, 0, 0],
               [4, 0, 0],
               [3, 2, 0],
               [3, 0, 2]]
     from sklearn.feature_extraction.text import TfidfTransformer
     transformer = TfidfTransformer(smooth_idf=False)
     tfidf = transformer.fit_transform(counts)
     print(tfidf.toarray())
    
     # lambda for basic stat computation
     summarizer_default = lambda x: np.sum(x, axis=0)
     summarizer_mean = lambda x: np.mean(x, axis=0)
    
     print(summarizer_default(tfidf))
     print(summarizer_mean(tfidf))
    

Result:

# Result post computing TF-IDF relevance scores
array([[ 0.81940995,  0.        ,  0.57320793],
           [ 1.        ,  0.        ,  0.        ],
           [ 1.        ,  0.        ,  0.        ],
           [ 1.        ,  0.        ,  0.        ],
           [ 0.47330339,  0.88089948,  0.        ],
           [ 0.58149261,  0.        ,  0.81355169]])

# Result post aggregation (Sum, Mean) 
[[ 4.87420595  0.88089948  1.38675962]]
[[ 0.81236766  0.14681658  0.2311266 ]]

If we observe closely, we realize the the feature1 witch occurred in all the document is not ignored completely because the sklearn implementation of idf = log [ n / df(d, t) ] + 1. +1 is added so that the important word which just so happens to occur in all document is not ignored. E.g. the word 'bike' occurring very frequently in classifying a particular document as 'motorcyle' (20_newsgroup dataset).

  1. Now in-regards to the first 2 questions, one is trying to interpret and understand the top common features that might be occurring in the document. In that case, aggregating in some form including all possible occurrence of the word in a doc is not taking anything away even mathematically. IMO such a query is very useful exploring the dataset and helping understanding what the dataset is about. The logic might be applied to vectorizing using Hashing as well.

    relevance_score = mean(tf(t,d) * idf(t,d)) = mean( (bias + inital_wt * F(t,d) / max{F(t',d)}) * log(N/df(d, t)) + 1 ))

  2. Question3 is very important as it might as well be contributing to features being selected for building a predictive model. Just using TF-IDF scores independently for feature selection might be misleading at multiple level. Adopting a more theoretical statistical test such as 'chi2' couple with TF-IDF relevance scores might be a better approach. Such statistical test also evaluates the importance of the feature in relation to the respective target class.

And of-course combining such interpretation with the model's learned feature weights would be very helpful in understanding the importance of text derived features completely.

** The problem is a little more complex to cover in detail here. But, hoping the above helps. What do others feel ?

Reference: https://arxiv.org/abs/1707.05261

I stumbled across the same problem somehow. I will share my solution here but don't really know how effective it is.

After calculating tf-idf basically what we have is like a matrix of terms vs documents.

[terms/docs : doc1  ,  doc2 , doc3..... docn
 term1      : tf(doc1)-idf, tf(doc2)-idf , tf(doc3)-idf.....
 .
 .
 .
 termn ........ ]

We can think of columns doc1,doc2...docn as scores given to every term according n different metrics. If we sum across the columns we are simply averaging the scores which is a naive way and does not completely represent the information captured. We can do something better as this is a top-k retrieval problem. One efficient algorithm is Fagin's algorithm and works on this idea :

The sorted lists are scanned until k data items are found which have been seen in all the lists, then the algorithm can stop and it is guaranteed that among all the data items seen so far, even those which were not present in all the lists, the top-k data items can be found.

Sorted lists here simply mean a single column of a particular doc becomes a list and we have n such lists. So sort each one of them and then do fagins on it.

Read about it more here

If we want to find the "saliency" or "importance" of the words within this corpus, can we simple do the sum of the tf-idf scores across all documents and divide it by the number of documents? If so, what is the mathematical interpretation of the sum of TF-IDF scores of words across documents?

If you summed td-idf scores across documents, terms that would otherwise have low scores might get a boost and terms with higher scores might have their scores depressed.

I don't think simply dividing by the total number of documents will be sufficient normalization to address this. Maybe incorporating document length into the normalization factor? Either way, I think all such methods would still need to be adjusted per domain.

So, generally speaking, mathematically I expect you would get an undesirable averaging effect.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!