Pandas to_csv() slow saving large dataframe

后端未结

关注

 3  2059

I\'m guessing this is an easy fix, but I\'m running into an issue that it\'s taking nearly an hour to save a pandas dataframe to a csv file using the to_csv()

相关标签:

3条回答

感动是毒

2020-12-17 11:02

You said "[...] of mostly numeric (decimal) data.". Do you have any column with time and/or dates?

I saved an 8 GB CSV in seconds when it has only numeric/string values, but it takes 20 minutes to save an 500 MB CSV with two Dates columns. So, what I would recommend is to convert each date column to a string before saving it. The following command is enough:

df['Column'] = df['Column'].astype(str)

I hope that this answer helps you.

P.S.: I understand that saving as a .hdf file solved the problem. But, sometimes, we do need a .csv file anyway.

0 讨论(0)

发布评论:

提交评论

加载中...

囚心锁ツ

2020-12-17 11:08

Adding my small insight since the 'gzip' alternative did not work for me - try using to_hdf method. This reduced the write time significantly! (less than a second for a 100MB file - CSV option preformed this in between 30-55 seconds)

stage.to_hdf(r'path/file.h5', key='stage', mode='w')

0 讨论(0)

发布评论:

提交评论

加载中...

无人共我

2020-12-17 11:13

You are reading compressed files and writing plaintext file. Could be IO bottleneck.

Writing compressed file could speedup writing up to 10x

stage.to_csv('output.csv.gz' , sep='|' , header=True , index=False , chunksize=100000 , compression='gzip' , encoding='utf-8')

Additionally you could experiment with different chunk sizes and compression methods (‘bz2’, ‘xz’).

0 讨论(0)

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复