Persistent dataflows with dask

问题

I am interested to work with persistent distributed dataflows with features similar to the ones of the Pegasus project: https://pegasus.isi.edu/ for example. Do you think there is a way to do that with dask?

I tried to implement something which works with a SLURM cluster and dask. I will below describe my solution in great lines in order to better specify my use case.

The idea is to execute medium size tasks (that run between few minutes to hours) which are specified with a graph which can have persistency and can easily been extended. I implemented something based on dask's scheduler and its graph api. In order to have persistency, I wrote two kind of decorators:

one "memoize" decorator that permits to serialize in a customizable way complexe arguments, and also the results, of the functions (a little bit like dask do with cachey or chest, or like spark does with its RDD objects) and
one "delayed" decorator that permits to execute functions on a cluster (SLURM). In practice the API of functions is modified in order that they take jobids of dependencies as arguments and return the jobid of the created job on the cluster. Also the functions are serialized in a text file "launch.py" wich is launched with the cluster's command line API.

The association taskname-jobid is saved in a json file which permits to manage persistency using status of the task returned by the cluster. This way to work permits to have a kind of persistency of the graph. It offer the possibility to easily debug tasks that failed. The fact to use a serialization mechanism offer the possibility to easily access to all intermediate results, even without the whole workflow and/or the functions that generated them. Also, in this way it is easy to interact with legacy applications that do not use that kind of dataflow mechanism.

This solution is certainly a little bit naive compared to other, more modern, ways to execute distributed workflows with dask and distributed but it seems to me to have some advantages du to its persistency (of tasks and data) capabilities.

I'm intersted to know if the solution seems pertinent or not and if it seems to describe an interesting, not adressed, use case by dask.

If someone can recommand me some other ways to do, I am also interested!

来源：https://stackoverflow.com/questions/42988110/persistent-dataflows-with-dask

标签

python

dataflow

dask