问题
I have a setup as follows:
# etl.py
from dask.distributed import Client
import dask
from tasks import task1, task2, task3
def runall(**kwargs):
print("done")
def etl():
client = Client()
tasks = {}
tasks['task1'] = dask.delayed(task)(*args)
tasks['task2'] = dask.delayed(task)(*args)
tasks['task3'] = dask.delayed(task)(*args)
out = dask.delayed(runall)(**tasks)
out.compute()
This logic was borrowed from luigi and works nicely with if statements to control what tasks to run.
However, some of the tasks load large amounts of data from SQL and cause GIL freeze warnings (At least this is my suspicion as it is hard to diagnose what line exactly causes the issue). Sometimes the graph / monitoring shown on 8787 does not show anything just scheduler empty
, I suspect these are caused by the app freezing dask. What is the best way to load large amounts of data from SQL in dask. (MSSQL and oracle). At the moment this is doen with sqlalchemy
with tuned settings. Would adding async
and await
help?
However, some of tasks are a bit slow and I'd like to use stuff like dask.dataframe
or bag
internally. The docs advise against calling delayed inside delayed. Does this also hold for dataframe
and bag
. The entire script is run on a single 40 core machine.
Using bag.starmap
I get a graph like this:
where the upper straight lines are added/ discovered once the computation reaches that task and compute is called inside it.
回答1:
There appears to be no rhyme or reason other than the machine was / is very busy and struggling to show the state updates and bokeh plots as desired.
来源:https://stackoverflow.com/questions/64911735/dask-scheduler-empty-graph-not-showing