NLTK: How to create a corpus from csv file

问题

I have a csv file as

col1         col2      col3

some text    someID    some value
some text    someID    some value

in each row, col1 corresponds to the text of an entire document. I would like to create a corpus from this csv. my aim is to use sklearn's TfidfVectorizer to compute document similarity and keyword extraction. So consider

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(<my corpus here>)

so then i can use

str = 'here is some text from a new document'
response = tfidf.transform([str])
feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
    print feature_names[col], ' - ', response[0, col]

how do i create a corpus using nltk? what form/data structure should the corpus be so that it can be supplied to the transform function?

回答1:

Check out read_csv from the pandas library. Here is the documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

You can install pandas by running pip install pandas at the command line. Then loading the csv and selecting that column should be as easy as the below:

data = pd.read_csv(path_to_csv)
docs = data['col1']

tfs = tfidf.fit_transform(docs)

来源：https://stackoverflow.com/questions/34232047/nltk-how-to-create-a-corpus-from-csv-file

标签

python

csv

nlp

nltk

tf-idf

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!