问题
I'm looking for the best way to parallelize on a cluster the following problem. I have several files
- folder/file001.csv
- folder/file002.csv
- :
- folder/file100.csv
They are disjoints with respect to the key
I want to use to groupby, that is if a set of keys is in file1.csv
any of these keys has an item in any other files.
In one side I can just run
df = dd.read_csv("folder/*")
df.groupby("key").apply(f, meta=meta).compute(scheduler='processes')
But I'm wondering if there is a better/smarter way to do so in a sort of delayed-groupby way.
Every filexxx.csv
fits in memory on a node. Given that every node has n
cores it will be ideal use all of them. For every single file I can use this hacky way
import numpy as np
import multiprocessing as mp
cores = mp.cpu_count() #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want
def parallelize(data, func):
data_split = np.array_split(data, partitions)
pool = mp.Pool(cores)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
data = parallelize(data, f);
And, again, I'm not sure if there is an efficent dask way to do so.
回答1:
you could use a Client (will run in multi process by default) and read your data with a certain blocksize
. you can get the amount of workers (and number of cores per worker) with the ncores method and then calculate the optimal blocksize
.
however according to the documantaion blocksize
is by default "computed based on available physical memory and the number of cores."
so i think the best way to do it is a simple:
from distributed import Client
# if you run on a single machine just do: client = Client()
client = Client('cluster_scheduler_path')
ddf = dd.read_csv("folder/*")
EDIT: after that use map_partitions
and do the gorupby for each partition:
# Note ddf is a dask dataframe and df is a pandas dataframe
new_ddf = ddf.map_partitions(lambda df: df.groupby("key").apply(f), meta=meta)
don't use compute
because it will result in a single pandas.dataframe
, instead use a dask
output method to keep the entire process parallel and larger then ram compatible.
来源:https://stackoverflow.com/questions/51402772/parallelization-on-cluster-dask