How to find why a task fails in dask distributed?

半世苍凉 提交于 2019-12-24 01:04:11

问题


I am developing a distributed computing system using dask.distributed. Tasks that I submit to it with the Executor.map function sometimes fail, while others seeming identical, run successfully.

Does the framework provide any means to diagnose problems?

update By failing I mean increasing counter of failed tasks in the Bokeh web UI, provided by the scheduler. Counter of finished tasks increases too.

Function that is run by the Executor.map returns None. It communicates to a database, retrieves some rows from its table, performs calculations and updates values.

I've got more than 40000 tasks in map, so it is a bit tedious to study logs.


回答1:


If a task fails then any attempt to retrieve the result will raise the same error that occurred on the worker

In [1]: from distributed import Client

In [2]: c = Client()

In [3]: def div(x, y):
   ...:     return x / y
   ...: 

In [4]: future = c.submit(div, 1, 0)

In [5]: future.result()
<ipython-input-3-398a43a7781e> in div()
      1 def div(x, y):
----> 2     return x / y

ZeroDivisionError: division by zero

However, other things can go wrong. For example you might not have the same software on your workers as on your client or your network might not let connections go through, or any of the other things that happen in real-world networks. To help diagnose these there are a few options:

  1. You can use the web interface to track the progress of your tasks and workers
  2. You can start IPython kernels in the scheduler or workers to inspect them directly


来源:https://stackoverflow.com/questions/39647019/how-to-find-why-a-task-fails-in-dask-distributed

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!