Get the document name in scikit-learn tf-idf matrix

前端 未结 1 668
深忆病人
深忆病人 2021-01-03 08:25

I have created a tf-idf matrix but now I want to retrieve top 2 words for each document. I want to pass document id and it should give me the top 2 words.

Right now,

相关标签:
1条回答
  • 2021-01-03 08:59

    By doing

    t = test_v.fit_transform(d.values())
    

    you lose any link to the document ids. A dict is not ordered so you have no idea which value is given in which order. The means that before passing the values to the fit_transform function you need to record which value corresponds to which id.

    For example what you can do is:

    counter = 0
    values = []
    key = {}
    
    
    for k,v in d.items():
        values.append(v)
        key[k] = counter
        counter+=1
    
    t = test_v.fit_transform(values)
    

    From there you can build a function to access this matix by document id:

    def get_doc_row(docid):
        rowid = key[docid]
        row = t[rowid,:]
        return row
    
    0 讨论(0)
提交回复
热议问题