Parallelization on cluster dask

问题

I'm looking for the best way to parallelize on a cluster the following problem. I have several files

folder/file001.csv
folder/file002.csv
:
folder/file100.csv

They are disjoints with respect to the key I want to use to groupby, that is if a set of keys is in file1.csv any of these keys has an item in any other files.

In one side I can just run

df = dd.read_csv("folder/*")
df.groupby("key").apply(f, meta=meta).compute(scheduler='processes')

But I'm wondering if there is a better/smarter way to do so in a sort of delayed-groupby way.

Every filexxx.csv fits in memory on a node. Given that every node has n cores it will be ideal use all of them. For every single file I can use this hacky way

import numpy as np
import multiprocessing as mp

cores = mp.cpu_count() #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want

def parallelize(data, func):
    data_split = np.array_split(data, partitions)
    pool = mp.Pool(cores)
    data = pd.concat(pool.map(func, data_split))
    pool.close()
    pool.join()
    return data

data = parallelize(data, f);

And, again, I'm not sure if there is an efficent dask way to do so.

回答1:

you could use a Client (will run in multi process by default) and read your data with a certain blocksize. you can get the amount of workers (and number of cores per worker) with the ncores method and then calculate the optimal blocksize.

however according to the documantaion blocksize is by default "computed based on available physical memory and the number of cores."

so i think the best way to do it is a simple:

from distributed import Client
# if you run on a single machine just do: client = Client()
client = Client('cluster_scheduler_path')
ddf = dd.read_csv("folder/*")

EDIT: after that use map_partitions and do the gorupby for each partition:

# Note ddf is a dask dataframe and df is a pandas dataframe 
new_ddf = ddf.map_partitions(lambda df: df.groupby("key").apply(f), meta=meta)

don't use compute because it will result in a single pandas.dataframe, instead use a dask output method to keep the entire process parallel and larger then ram compatible.

来源：https://stackoverflow.com/questions/51402772/parallelization-on-cluster-dask

标签

dask

dask-distributed