I\'m guessing this is an easy fix, but I\'m running into an issue that it\'s taking nearly an hour to save a pandas dataframe to a csv file using the to_csv()
You said "[...] of mostly numeric (decimal) data.". Do you have any column with time and/or dates?
I saved an 8 GB CSV in seconds when it has only numeric/string values, but it takes 20 minutes to save an 500 MB CSV with two Dates
columns. So, what I would recommend is to convert each date column to a string before saving it. The following command is enough:
df['Column'] = df['Column'].astype(str)
I hope that this answer helps you.
P.S.: I understand that saving as a .hdf
file solved the problem. But, sometimes, we do need a .csv
file anyway.
Adding my small insight since the 'gzip' alternative did not work for me - try using to_hdf method. This reduced the write time significantly! (less than a second for a 100MB file - CSV option preformed this in between 30-55 seconds)
stage.to_hdf(r'path/file.h5', key='stage', mode='w')
You are reading compressed files and writing plaintext file. Could be IO bottleneck.
Writing compressed file could speedup writing up to 10x
stage.to_csv('output.csv.gz'
, sep='|'
, header=True
, index=False
, chunksize=100000
, compression='gzip'
, encoding='utf-8')
Additionally you could experiment with different chunk sizes and compression methods (‘bz2’, ‘xz’).