What is an efficient data structure for tokenized data in Python?

≯℡__Kan透↙ 提交于 2019-12-11 02:52:37

问题


I have a pandas dataframe that has a column with some text. I want to modify the dataframe such that there is a column for every distinct word that occurs across all rows, and a boolean indicating whether or not that word occurs in that particular row's value in my text column.

I have some code to do this:

from pandas import *

a = read_table('file.tsv', sep='\t', index_col=False)
b = DataFrame(a['text'].str.split().tolist()).stack().value_counts()

for i in b.index:
    a[i] = Series(numpy.zeros(len(a.index)))

for i in b.index:
    for j in a.index:
        if i in str.split(a['text'][j]:
            a[i][j] = 1

However, my dataset is very large (200,000 rows and about 70,000 unique words). Is there a more efficient way to do this that won't destroy my computer?


回答1:


I would recommend using sklearn, specifically CountVectorizer.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
vect= CountVectorizer(binary =True)



 df= pd.DataFrame({'text':['cat on the cat','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat'],'labels':\
                  [1,0,1,1,0,0,1,1]})




X = vect.fit_transform(df['text'].values)
y = df['labels'].values
X

<8x16 sparse matrix of type '<type 'numpy.int64'>'
with 23 stored elements in Compressed Sparse Row format>

This returns a sparse matrix where m are the rows from df and n is the set of words. The sparse format is preferable for saving memory where the majority of elements of the matrix are 0. Leaving it as sparse seems the way to go, and many of the 'sklearn' algorithms take a sparse input.

You can create a data frame from X (if really necessary, but it will be big):

word_counts =pd.DataFrame(X.todense(),columns = vect.get_feature_names())


来源:https://stackoverflow.com/questions/28667154/what-is-an-efficient-data-structure-for-tokenized-data-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!