How should I write multiple CSV files efficiently using dask.dataframe?

问题

Here is the summary of what I'm doing:

At first, I do this by normal multiprocessing and pandas package:

Step 1. Get the list of files name which I'm gonna to read

import os    
files = os.listdir(DATA_PATH + product)

Step 2. loop over the list

from multiprocessing import Pool
import pandas as pd    

def readAndWriteCsvFiles(file):
    ### Step 2.1 read csv file into dataframe 
    data = pd.read_csv(DATA_PATH + product + "/" + file, parse_dates=True, infer_datetime_format=False)

    ### Step 2.2 do some calculation
    ### .......

    ### Step 2.3 write the dataframe to csv to another folder
    data.to_csv("another folder/"+file)

if __name__ == '__main__':
    cl = Pool(4)
    cl.map(readAndWriteCsvFiles, files, chunksize=1)
    cl.close()
    cl.join()

The code works fine, but it's very slow.

It needs about 1000 second to do the task.

Compare to R programme using library(parallel) and parSapply function.

The R programme only takes about 160 seconds.

So then I tried with dask.delayed and dask.dataframe with following code:

Step 1. Get the list of files name which I'm gonna to read

import os    
files = os.listdir(DATA_PATH + product)

Step 2. loop over the list

from dask.delayed import delayed
import dask.dataframe as dd
from dask import compute

def readAndWriteCsvFiles(file):
    ### Step 2.1 read csv file into dataframe 
    data = dd.read_csv(DATA_PATH + product + "/" + file, parse_dates=True, infer_datetime_format=False, assume_missing=True)

    ### Step 2.2 do some calculation
    ### .......

    ### Step 2.3 write the dataframe to csv to another folder
    data.to_csv(filename="another folder/*", name_function=lambda x: file)

compute([delayed(readAndWriteCsvFiles)(file) for file in files])

This time, I found if I commented out both step 2.3 in dask code and pandas code, dask would run way more faster then normal pandas and multiprocessing.

But if I invoke the to_csv method, then dask is as slow as pandas.

Any solution?

Thanks

回答1:

Reading and writing CSV files is often bound by the GIL. You might want to try parallelizing with processes rather than with threads (the default for dask delayed).

You can achieve this by adding the scheduler='processes' keyword to your compute call.

compute([delayed(readAndWriteCsvFiles)(file) for file in files], scheduler='processes')

See scheduling documentation for more information

Also, note that you're not using dask.dataframe here, but rather dask.delayed.

来源：https://stackoverflow.com/questions/52342245/how-should-i-write-multiple-csv-files-efficiently-using-dask-dataframe

标签

export-to-csv

dask

dask-delayed