What is an efficient data structure for tokenized data in Python?

问题

I have a pandas dataframe that has a column with some text. I want to modify the dataframe such that there is a column for every distinct word that occurs across all rows, and a boolean indicating whether or not that word occurs in that particular row's value in my text column.

I have some code to do this:

from pandas import *

a = read_table('file.tsv', sep='\t', index_col=False)
b = DataFrame(a['text'].str.split().tolist()).stack().value_counts()

for i in b.index:
    a[i] = Series(numpy.zeros(len(a.index)))

for i in b.index:
    for j in a.index:
        if i in str.split(a['text'][j]:
            a[i][j] = 1

However, my dataset is very large (200,000 rows and about 70,000 unique words). Is there a more efficient way to do this that won't destroy my computer?

回答1:

I would recommend using sklearn, specifically CountVectorizer.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
vect= CountVectorizer(binary =True)



 df= pd.DataFrame({'text':['cat on the cat','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat'],'labels':\
                  [1,0,1,1,0,0,1,1]})




X = vect.fit_transform(df['text'].values)
y = df['labels'].values
X

<8x16 sparse matrix of type '<type 'numpy.int64'>'
with 23 stored elements in Compressed Sparse Row format>

This returns a sparse matrix where m are the rows from df and n is the set of words. The sparse format is preferable for saving memory where the majority of elements of the matrix are 0. Leaving it as sparse seems the way to go, and many of the 'sklearn' algorithms take a sparse input.

You can create a data frame from X (if really necessary, but it will be big):

word_counts =pd.DataFrame(X.todense(),columns = vect.get_feature_names())

来源：https://stackoverflow.com/questions/28667154/what-is-an-efficient-data-structure-for-tokenized-data-in-python

标签

python

performance

text

pandas

tokenize