Writing Dask partitions into single file

前端 未结 2 892
孤城傲影
孤城傲影 2020-12-29 04:04

New to dask,I have a 1GB CSV file when I read it in dask dataframe it creates around 50 partitions after my changes in the file when I

相关标签:
2条回答
  • 2020-12-29 04:37

    Short answer

    No, Dask.dataframe.to_csv only writes CSV files to different files, one file per partition. However, there are ways around this.

    Concatenate Afterwards

    Perhaps just concatenate the files after dask.dataframe writes them? This is likely to be near-optimal in terms of performance.

    df.to_csv('/path/to/myfiles.*.csv')
    from glob import glob
    filenames = glob('/path/to/myfiles.*.csv')
    with open('outfile.csv', 'w') as out:
        for fn in filenames:
            with open(fn) as f:
                out.write(f.read())  # maybe add endline here as well?
    

    Or use Dask.delayed

    However, you can do this yourself using dask.delayed, by using dask.delayed alongside dataframes

    This gives you a list of delayed values that you can use however you like:

    list_of_delayed_values = df.to_delayed()
    

    It's then up to you to structure a computation to write these partitions sequentially to a single file. This isn't hard to do, but can cause a bit of backup on the scheduler.

    Edit 1: (On October 23, 2019)

    In Dask 2.6.x, there is a parameter as single_file. By default, It is False. You can set it True to get single file output without using df.compute().

    For Example:

    df.to_csv('/path/to/myfiles.csv', single_file = True)
    

    Reference: Documentation for to_csv

    0 讨论(0)
  • 2020-12-29 04:43

    you can convert your dask dataframe to a pandas dataframe with the compute function and then use the to_csv. something like this:

    df_dask.compute().to_csv('csv_path_file.csv')

    0 讨论(0)
提交回复
热议问题