What is the fastest way to output large DataFrame into a CSV file?

前端 未结 4 1405
北海茫月
北海茫月 2020-12-01 14:19

For python / pandas I find that df.to_csv(fname) works at a speed of ~1 mln rows per min. I can sometimes improve performance by a factor of 7 like this:

def         


        
相关标签:
4条回答
  • 2020-12-01 14:24

    In 2019 for cases like this, it may be better to just use numpy. Look at the timings:

    aa.to_csv('pandas_to_csv', index=False)
    # 6.47 s
    
    df2csv(aa,'code_from_question', myformats=['%d','%.1f','%.1f','%.1f'])
    # 4.59 s
    
    from numpy import savetxt
    
    savetxt(
        'numpy_savetxt', aa.values, fmt='%d,%.1f,%.1f,%.1f',
        header=','.join(aa.columns), comments=''
    )
    # 3.5 s
    

    So you can cut the time by a factor of two using numpy. This, of course, comes at a cost of reduced flexibility (when compared to aa.to_csv).

    Benchmarked with Python 3.7, pandas 0.23.4, numpy 1.15.2 (xrange was replaced by range to make the posted function from the question work in Python 3).

    PS. If you need to include the index, savetxt will work fine - just pass df.rest_index().values and adjust the formatting string accordingly.

    0 讨论(0)
  • 2020-12-01 14:29

    Lev. Pandas has rewritten to_csv to make a big improvement in native speed. The process is now i/o bound, accounts for many subtle dtype issues, and quote cases. Here is our performance results vs. 0.10.1 (in the upcoming 0.11) release. These are in ms, lower ratio is better.

    Results:
                                                t_head  t_baseline      ratio
    name                                                                     
    frame_to_csv2 (100k) rows                 190.5260   2244.4260     0.0849
    write_csv_standard  (10k rows)             38.1940    234.2570     0.1630
    frame_to_csv_mixed  (10k rows, mixed)     369.0670   1123.0412     0.3286
    frame_to_csv (3k rows, wide)              112.2720    226.7549     0.4951
    

    So Throughput for a single dtype (e.g. floats), not too wide is about 20M rows / min, here is your example from above.

    In [12]: df = pd.DataFrame({'A' : np.array(np.arange(45000000),dtype='float64')}) 
    In [13]: df['B'] = df['A'] + 1.0   
    In [14]: df['C'] = df['A'] + 2.0
    In [15]: df['D'] = df['A'] + 2.0
    In [16]: %timeit -n 1 -r 1 df.to_csv('test.csv')
    1 loops, best of 1: 119 s per loop
    
    0 讨论(0)
  • 2020-12-01 14:33

    use chunksize. I have found that makes a hell lot of difference. If you have memory in hand use good chunksize (no of rows) to get into memory and then write once.

    0 讨论(0)
  • 2020-12-01 14:40

    Your df_to_csv function is very nice, except it does a lot of assumptions and doesn't work for the general case.

    If it works for you, that's good, but be aware that it is not a general solution. CSV can contain commas, so what happens if there is this tuple to be written? ('a,b','c')

    The python csv module would quote that value so that no confusion arises, and would escape quotes if quotes are present in any of the values. Of course generating something that works in all cases is much slower. But I suppose you only have a bunch of numbers.

    You could try this and see if it is faster:

    #data is a tuple containing tuples
    
    for row in data:
        for col in xrange(len(row)):
            f.write('%d' % row[col])
            if col < len(row)-1:
                f.write(',')
        f.write('\n')
    

    I don't know if that would be faster. If not it's because too many system calls are done, so you might use StringIO instead of direct output and then dump it to a real file every once in a while.

    0 讨论(0)
提交回复
热议问题