Chunking, processing & merging dataset in Pandas/Python

问题

There is a large dataset, containing a strings. I just want to open it via read_fwf using widths, like this:

widths = [3, 7, ..., 9, 7]
tp = pandas.read_fwf(file, widths=widths, header=None)

It would help me to mark the data, But the system crashes (works with nrows=20000). Then I decided to do it by chunk (e.g. 20000 rows), like this:

cs = 20000
for chunk in pd.read_fwf(file, widths=widths, header=None, chunksize=ch)
...:  <some code using chunk>

My question is: what should I do in a loop to merge (concatenate?) the chunks back in a .csv file after some processing of chunk (marking the row, dropping or modyfiing the column)? Or there is another way?

回答1:

I'm going to assume that since reading the entire file

tp = pandas.read_fwf(file, widths=widths, header=None)

fails but reading in chunks works, that the file is too big to be read at once and that you encountered a MemoryError.

In that case, if you can process the data in chunks, then to concatenate the results in a CSV, you could use chunk.to_csv to write the CSV in chunks:

filename = ...
for chunk in pd.read_fwf(file, widths=widths, header=None, chunksize=ch)
    # process the chunk
    chunk.to_csv(filename, mode='a')

Note that mode='a' opens the file in append mode, so that the output of each chunk.to_csv call is appended to the same file.

来源：https://stackoverflow.com/questions/29907788/chunking-processing-merging-dataset-in-pandas-python

标签

python

pandas

merge

dataset

chunking

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!