问题
Here is the summary of what I'm doing:
At first, I do this by normal multiprocessing and pandas package:
Step 1. Get the list of files name which I'm gonna to read
import os
files = os.listdir(DATA_PATH + product)
Step 2. loop over the list
from multiprocessing import Pool
import pandas as pd
def readAndWriteCsvFiles(file):
### Step 2.1 read csv file into dataframe
data = pd.read_csv(DATA_PATH + product + "/" + file, parse_dates=True, infer_datetime_format=False)
### Step 2.2 do some calculation
### .......
### Step 2.3 write the dataframe to csv to another folder
data.to_csv("another folder/"+file)
if __name__ == '__main__':
cl = Pool(4)
cl.map(readAndWriteCsvFiles, files, chunksize=1)
cl.close()
cl.join()
The code works fine, but it's very slow.
It needs about 1000 second to do the task.
Compare to R programme using library(parallel)
and parSapply
function.
The R programme only takes about 160 seconds.
So then I tried with dask.delayed and dask.dataframe with following code:
Step 1. Get the list of files name which I'm gonna to read
import os
files = os.listdir(DATA_PATH + product)
Step 2. loop over the list
from dask.delayed import delayed
import dask.dataframe as dd
from dask import compute
def readAndWriteCsvFiles(file):
### Step 2.1 read csv file into dataframe
data = dd.read_csv(DATA_PATH + product + "/" + file, parse_dates=True, infer_datetime_format=False, assume_missing=True)
### Step 2.2 do some calculation
### .......
### Step 2.3 write the dataframe to csv to another folder
data.to_csv(filename="another folder/*", name_function=lambda x: file)
compute([delayed(readAndWriteCsvFiles)(file) for file in files])
This time, I found if I commented out both step 2.3 in dask code and pandas code, dask would run way more faster then normal pandas and multiprocessing.
But if I invoke the to_csv method, then dask is as slow as pandas.
Any solution?
Thanks
回答1:
Reading and writing CSV files is often bound by the GIL. You might want to try parallelizing with processes rather than with threads (the default for dask delayed).
You can achieve this by adding the scheduler='processes'
keyword to your compute call.
compute([delayed(readAndWriteCsvFiles)(file) for file in files], scheduler='processes')
See scheduling documentation for more information
Also, note that you're not using dask.dataframe here, but rather dask.delayed.
来源:https://stackoverflow.com/questions/52342245/how-should-i-write-multiple-csv-files-efficiently-using-dask-dataframe