How to find why a task fails in dask distributed?

问题

I am developing a distributed computing system using dask.distributed. Tasks that I submit to it with the Executor.map function sometimes fail, while others seeming identical, run successfully.

Does the framework provide any means to diagnose problems?

update By failing I mean increasing counter of failed tasks in the Bokeh web UI, provided by the scheduler. Counter of finished tasks increases too.

Function that is run by the Executor.map returns None. It communicates to a database, retrieves some rows from its table, performs calculations and updates values.

I've got more than 40000 tasks in map, so it is a bit tedious to study logs.

回答1:

If a task fails then any attempt to retrieve the result will raise the same error that occurred on the worker

In [1]: from distributed import Client

In [2]: c = Client()

In [3]: def div(x, y):
   ...:     return x / y
   ...: 

In [4]: future = c.submit(div, 1, 0)

In [5]: future.result()
<ipython-input-3-398a43a7781e> in div()
      1 def div(x, y):
----> 2     return x / y

ZeroDivisionError: division by zero

However, other things can go wrong. For example you might not have the same software on your workers as on your client or your network might not let connections go through, or any of the other things that happen in real-world networks. To help diagnose these there are a few options:

You can use the web interface to track the progress of your tasks and workers
You can start IPython kernels in the scheduler or workers to inspect them directly

来源：https://stackoverflow.com/questions/39647019/how-to-find-why-a-task-fails-in-dask-distributed

标签

python

distributed

dask