Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

前端 未结 2 2002
独厮守ぢ
独厮守ぢ 2020-12-07 18:46

I am working on keyword extraction problem. Consider the very general case

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words=\'english\')

t = \"\"\"Two         


        
相关标签:
2条回答
  • 2020-12-07 19:36

    You have to do a little bit of a song and dance to get the matrices as numpy arrays instead, but this should do what you're looking for:

    feature_array = np.array(tfidf.get_feature_names())
    tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1]
    
    n = 3
    top_n = feature_array[tfidf_sorting][:n]
    

    This gives me:

    array([u'fruit', u'travellers', u'jupiter'], 
      dtype='<U13')
    

    The argsort call is really the useful one, here are the docs for it. We have to do [::-1] because argsort only supports sorting small to large. We call flatten to reduce the dimensions to 1d so that the sorted indices can be used to index the 1d feature array. Note that including the call to flatten will only work if you're testing one document at at time.

    Also, on another note, did you mean something like tfs = tfidf.fit_transform(t.split("\n\n"))? Otherwise, each term in the multiline string is being treated as a "document". Using \n\n instead means that we are actually looking at 4 documents (one for each line), which makes more sense when you think about tfidf.

    0 讨论(0)
  • 2020-12-07 19:46

    Solution using sparse matrix itself (without .toarray())!

    import numpy as np
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    tfidf = TfidfVectorizer(stop_words='english')
    corpus = [
        'I would like to check this document',
        'How about one more document',
        'Aim is to capture the key words from the corpus',
        'frequency of words in a document is called term frequency'
    ]
    
    X = tfidf.fit_transform(corpus)
    feature_names = np.array(tfidf.get_feature_names())
    
    
    new_doc = ['can key words in this new document be identified?',
               'idf is the inverse document frequency caculcated for each of the words']
    responses = tfidf.transform(new_doc)
    
    
    def get_top_tf_idf_words(response, top_n=2):
        sorted_nzs = np.argsort(response.data)[:-(top_n+1):-1]
        return feature_names[response.indices[sorted_nzs]]
      
    print([get_top_tf_idf_words(response,2) for response in responses])
    
    #[array(['key', 'words'], dtype='<U9'),
     array(['frequency', 'words'], dtype='<U9')]
    
    0 讨论(0)
提交回复
热议问题