Pandas to_csv() slow saving large dataframe

后端 未结 3 2059
攒了一身酷
攒了一身酷 2020-12-17 10:35

I\'m guessing this is an easy fix, but I\'m running into an issue that it\'s taking nearly an hour to save a pandas dataframe to a csv file using the to_csv()

相关标签:
3条回答
  • 2020-12-17 11:02

    You said "[...] of mostly numeric (decimal) data.". Do you have any column with time and/or dates?

    I saved an 8 GB CSV in seconds when it has only numeric/string values, but it takes 20 minutes to save an 500 MB CSV with two Dates columns. So, what I would recommend is to convert each date column to a string before saving it. The following command is enough:

    df['Column'] = df['Column'].astype(str) 
    

    I hope that this answer helps you.

    P.S.: I understand that saving as a .hdf file solved the problem. But, sometimes, we do need a .csv file anyway.

    0 讨论(0)
  • 2020-12-17 11:08

    Adding my small insight since the 'gzip' alternative did not work for me - try using to_hdf method. This reduced the write time significantly! (less than a second for a 100MB file - CSV option preformed this in between 30-55 seconds)

    stage.to_hdf(r'path/file.h5', key='stage', mode='w')
    
    0 讨论(0)
  • 2020-12-17 11:13

    You are reading compressed files and writing plaintext file. Could be IO bottleneck.

    Writing compressed file could speedup writing up to 10x

        stage.to_csv('output.csv.gz'
             , sep='|'
             , header=True
             , index=False
             , chunksize=100000
             , compression='gzip'
             , encoding='utf-8')
    

    Additionally you could experiment with different chunk sizes and compression methods (‘bz2’, ‘xz’).

    0 讨论(0)
提交回复
热议问题