What is an efficient data structure for tokenized data in Python?
问题 I have a pandas dataframe that has a column with some text. I want to modify the dataframe such that there is a column for every distinct word that occurs across all rows, and a boolean indicating whether or not that word occurs in that particular row's value in my text column. I have some code to do this: from pandas import * a = read_table('file.tsv', sep='\t', index_col=False) b = DataFrame(a['text'].str.split().tolist()).stack().value_counts() for i in b.index: a[i] = Series(numpy.zeros