What is the simplest way to get tfidf with pandas dataframe?

ⅰ亾dé卋堺 提交于 2019-11-27 11:41:47

问题


I want to calculate tf-idf from the documents below. I'm using python and pandas.

import pandas as pd
df = pd.DataFrame({'docId': [1,2,3], 
               'sent': ['This is the first sentence','This is the second sentence', 'This is the third sentence']})

First, I thought I would need to get word_count for each row. So I wrote a simple function:

def word_count(sent):
    word2cnt = dict()
    for word in sent.split():
        if word in word2cnt: word2cnt[word] += 1
        else: word2cnt[word] = 1
return word2cnt

And then, I applied it to each row.

df['word_count'] = df['sent'].apply(word_count)

But now I'm lost. I know there's an easy method to calculate tf-idf if I use Graphlab, but I want to stick with an open source option. Both Sklearn and gensim look overwhelming. What's the simplest solution to get tf-idf?


回答1:


Scikit-learn implementation is really easy :

from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df['sent'])

There are plenty of parameters you can specify. See the documentation here

The output of fit_transform will be a sparse matrix, if you want to visualize it you can do x.toarray()

In [44]: x.toarray()
Out[44]: 
array([[ 0.64612892,  0.38161415,  0.        ,  0.38161415,  0.38161415,
         0.        ,  0.38161415],
       [ 0.        ,  0.38161415,  0.64612892,  0.38161415,  0.38161415,
         0.        ,  0.38161415],
       [ 0.        ,  0.38161415,  0.        ,  0.38161415,  0.38161415,
         0.64612892,  0.38161415]])


来源:https://stackoverflow.com/questions/37593293/what-is-the-simplest-way-to-get-tfidf-with-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!