Pandas.read_csv() MemoryError

南楼画角 提交于 2019-12-24 11:49:15

问题


I have a 1gb csv file. The file has about 10000000(10 Mil) rows. I need to iterate through the rows to get the max of a few selected rows(based on a condition). The issue is reading the csv file.

I use the Pandas package for Python. The read_csv() function throws the MemoryError while reading the csv file. 1) I have tried to split the file into chunks and read them, Now, the concat() function has a memory issue.

tp  = pd.read_csv('capture2.csv', iterator=True, chunksize=10000, dtype={'timestamp': float, 'vdd_io_soc_i': float, 'vdd_io_soc_v': float,  'vdd_io_plat_i': float, 'vdd_io_plat_v': float, 'vdd_ext_flash_i': float,   'vdd_ext_flash_v': float,   'vsys_i vsys_v': float, 'vdd_aon_dig_i': float, 'vdd_aon_dig_v': float, 'vdd_soc_1v8_i': float, 'vdd_soc_1v8_v': float})

df = pd.concat(tp,ignore_index=True)

I have used the dtype to reduce memory hog, still there is no improvement.

Based on multiple blog posts. I have updated numpy, pandas all of them to the latest version. Still no luck.

It would be great if anyone has a solution to this issue.

Please note:

  • I have a 64bit operating system(Windows 7)

  • I am running Python 2.7.10 (default, May 23 2015, 09:40:32) [MSC v.1500 32 bit]

  • I have 4GB Ram.

  • Numpy latest (pip installer says latest version installed)

  • Pandas Latest.(pip installer says latest version installed)


回答1:


If the file you are trying to read is too large to be contained as a whole in memory, you also cannot read it in chunks then reassemble it in memory, because in the end that needs at least as much memory.

You could try to read the file in chuncks, filter out unnecessary rows in each chunck (based on the condition you are mentionning), then reassemble the remaining rows in a dataframe.

Which gives something like that:

df = pd.concat(apply_your_filter(chunck_df) for chunck_df in pd.read_csv('capture2.csv', iterator=True, chunksize=10000, dtype={'timestamp': float, 'vdd_io_soc_i': float, 'vdd_io_soc_v': float, 'vdd_io_plat_i': float, 'vdd_io_plat_v': float, 'vdd_ext_flash_i': float, 'vdd_ext_flash_v': float, 'vsys_i vsys_v': float, 'vdd_aon_dig_i': float, 'vdd_aon_dig_v': float, 'vdd_soc_1v8_i': float, 'vdd_soc_1v8_v': float}), ignore_index=True)

And/or find the max of each chunck, then the max of each of those chunck maxs.




回答2:


Pandas read_csv() has a low memory flag.

tp  = pd.read_csv('capture2.csv',low_memory=True, ...)

The low_memory flag is only available if you use the C parser

engine : {‘c’, ‘python’}, optional

Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.

You can also use the memory_map flag

memory_map : boolean, default False

If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.

source


p.s. use 64bit python - see my comment




回答3:


Could you please check the Python version? Probably you have 32bit that has some limitations.

Try install 64 bit and try to load the data into pandas without concat like:

df = pd.read_csv('/path/to/csv')


来源:https://stackoverflow.com/questions/42931068/pandas-read-csv-memoryerror

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!