问题
I have a csv file as
col1 col2 col3
some text someID some value
some text someID some value
in each row, col1 corresponds to the text of an entire document. I would like to create a corpus from this csv. my aim is to use sklearn's TfidfVectorizer to compute document similarity and keyword extraction. So consider
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(<my corpus here>)
so then i can use
str = 'here is some text from a new document'
response = tfidf.transform([str])
feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
print feature_names[col], ' - ', response[0, col]
how do i create a corpus using nltk? what form/data structure should the corpus be so that it can be supplied to the transform function?
回答1:
Check out read_csv
from the pandas
library. Here is the documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
You can install pandas by running pip install pandas
at the command line. Then loading the csv and selecting that column should be as easy as the below:
data = pd.read_csv(path_to_csv)
docs = data['col1']
tfs = tfidf.fit_transform(docs)
来源:https://stackoverflow.com/questions/34232047/nltk-how-to-create-a-corpus-from-csv-file