Persistent dataflows with dask

梦想的初衷 提交于 2019-12-23 05:09:16

问题


I am interested to work with persistent distributed dataflows with features similar to the ones of the Pegasus project: https://pegasus.isi.edu/ for example. Do you think there is a way to do that with dask?

I tried to implement something which works with a SLURM cluster and dask. I will below describe my solution in great lines in order to better specify my use case.

The idea is to execute medium size tasks (that run between few minutes to hours) which are specified with a graph which can have persistency and can easily been extended. I implemented something based on dask's scheduler and its graph api. In order to have persistency, I wrote two kind of decorators:

  • one "memoize" decorator that permits to serialize in a customizable way complexe arguments, and also the results, of the functions (a little bit like dask do with cachey or chest, or like spark does with its RDD objects) and
  • one "delayed" decorator that permits to execute functions on a cluster (SLURM). In practice the API of functions is modified in order that they take jobids of dependencies as arguments and return the jobid of the created job on the cluster. Also the functions are serialized in a text file "launch.py" wich is launched with the cluster's command line API.

The association taskname-jobid is saved in a json file which permits to manage persistency using status of the task returned by the cluster. This way to work permits to have a kind of persistency of the graph. It offer the possibility to easily debug tasks that failed. The fact to use a serialization mechanism offer the possibility to easily access to all intermediate results, even without the whole workflow and/or the functions that generated them. Also, in this way it is easy to interact with legacy applications that do not use that kind of dataflow mechanism.

This solution is certainly a little bit naive compared to other, more modern, ways to execute distributed workflows with dask and distributed but it seems to me to have some advantages du to its persistency (of tasks and data) capabilities.

I'm intersted to know if the solution seems pertinent or not and if it seems to describe an interesting, not adressed, use case by dask.

If someone can recommand me some other ways to do, I am also interested!

来源:https://stackoverflow.com/questions/42988110/persistent-dataflows-with-dask

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!