问题
I am developing a distributed computing system using dask.distributed
. Tasks that I submit to it with the Executor.map
function sometimes fail, while others seeming identical, run successfully.
Does the framework provide any means to diagnose problems?
update By failing I mean increasing counter of failed tasks in the Bokeh web UI, provided by the scheduler. Counter of finished tasks increases too.
Function that is run by the Executor.map
returns None
. It communicates to a database, retrieves some rows from its table, performs calculations and updates values.
I've got more than 40000 tasks in map, so it is a bit tedious to study logs.
回答1:
If a task fails then any attempt to retrieve the result will raise the same error that occurred on the worker
In [1]: from distributed import Client
In [2]: c = Client()
In [3]: def div(x, y):
...: return x / y
...:
In [4]: future = c.submit(div, 1, 0)
In [5]: future.result()
<ipython-input-3-398a43a7781e> in div()
1 def div(x, y):
----> 2 return x / y
ZeroDivisionError: division by zero
However, other things can go wrong. For example you might not have the same software on your workers as on your client or your network might not let connections go through, or any of the other things that happen in real-world networks. To help diagnose these there are a few options:
- You can use the web interface to track the progress of your tasks and workers
- You can start IPython kernels in the scheduler or workers to inspect them directly
来源:https://stackoverflow.com/questions/39647019/how-to-find-why-a-task-fails-in-dask-distributed