问题
There is a large dataset, containing a strings. I just want to open it via read_fwf using widths, like this:
widths = [3, 7, ..., 9, 7]
tp = pandas.read_fwf(file, widths=widths, header=None)
It would help me to mark the data, But the system crashes (works with nrows=20000). Then I decided to do it by chunk (e.g. 20000 rows), like this:
cs = 20000
for chunk in pd.read_fwf(file, widths=widths, header=None, chunksize=ch)
...: <some code using chunk>
My question is: what should I do in a loop to merge (concatenate?) the chunks back in a .csv file after some processing of chunk (marking the row, dropping or modyfiing the column)? Or there is another way?
回答1:
I'm going to assume that since reading the entire file
tp = pandas.read_fwf(file, widths=widths, header=None)
fails but reading in chunks works, that the file is too big to be read at once and that you encountered a MemoryError.
In that case, if you can process the data in chunks, then to concatenate the results in a CSV, you could use chunk.to_csv
to write the CSV in chunks:
filename = ...
for chunk in pd.read_fwf(file, widths=widths, header=None, chunksize=ch)
# process the chunk
chunk.to_csv(filename, mode='a')
Note that mode='a'
opens the file in append mode, so that the output of each
chunk.to_csv
call is appended to the same file.
来源:https://stackoverflow.com/questions/29907788/chunking-processing-merging-dataset-in-pandas-python