dask | 易学教程

dask handle delayed failures

阅读更多关于 dask handle delayed failures

问题 How can I port the following function to dask in order to parallelize it? from time import sleep from dask.distributed import Client from dask import delayed client = Client(n_workers=4) from tqdm import tqdm tqdm.pandas() # linear things = [1,2,3] _x = [] _y = [] def my_slow_function(foo): sleep(2) x = foo y = 2 * foo assert y < 5 return x, y for foo in tqdm(things): try: x_v, y_v = my_slow_function(foo) _x.append(x_v) if y_v is not None: _y.append(y_v) except AssertionError: print(f'failed:

Dask scheduler empty / graph not showing

阅读更多关于 Dask scheduler empty / graph not showing

问题 I have a setup as follows: # etl.py from dask.distributed import Client import dask from tasks import task1, task2, task3 def runall(**kwargs): print("done") def etl(): client = Client() tasks = {} tasks['task1'] = dask.delayed(task)(*args) tasks['task2'] = dask.delayed(task)(*args) tasks['task3'] = dask.delayed(task)(*args) out = dask.delayed(runall)(**tasks) out.compute() This logic was borrowed from luigi and works nicely with if statements to control what tasks to run. However, some of

How do the batching instructions of Dask delayed best practices work?

阅读更多关于 How do the batching instructions of Dask delayed best practices work?

问题 I guess I'm missing something (still a Dask Noob) but I'm trying the batching suggestion to avoid too many Dask tasks from here: https://docs.dask.org/en/latest/delayed-best-practices.html and can't make them work. This is what I tried: import dask def f(x): return x*x def batch(seq): sub_results = [] for x in seq: sub_results.append(f(x)) return sub_results batches = [] for i in range(0, 1000000000, 1000000): result_batch = dask.delayed(batch, range(i, i + 1000000)) batches.append(result

Simple way to Dask concatenate (horizontal, axis=1, columns)

阅读更多关于 Simple way to Dask concatenate (horizontal, axis=1, columns)

问题 Action Reading two csv (data.csv and label.csv) to a single dataframe. df = dd.read_csv(data_files, delimiter=' ', header=None, names=['x', 'y', 'z', 'intensity', 'r', 'g', 'b']) df_label = dd.read_csv(label_files, delimiter=' ', header=None, names=['label']) Problem Concatenation of columns requires known divisions. However setting an index will sort the data, which I explicitly do not want, because order of both files is their match. df = dd.concat([df, df_label], axis=1) ------------------

Simple way to Dask concatenate (horizontal, axis=1, columns)

阅读更多关于 Simple way to Dask concatenate (horizontal, axis=1, columns)

Simple way to Dask concatenate (horizontal, axis=1, columns)

阅读更多关于 Simple way to Dask concatenate (horizontal, axis=1, columns)

Simple way to Dask concatenate (horizontal, axis=1, columns)

阅读更多关于 Simple way to Dask concatenate (horizontal, axis=1, columns)

Writing xarray multiindex data in chunks

阅读更多关于 Writing xarray multiindex data in chunks

问题 I am trying to efficiently restructure a large multidimentional dataset. Let assume I have a number of remotely sensed images over time with a number of bands with coordinates x y for pixel location, time for time of image acquisition, and band for different data collected. In my use case lets assume the xarray coord lengths are roughly x (3000), y (3000), time (10), with bands (40) of floating point data. So 100gb+ of data. I have been trying to work from this example but I am having trouble

Writing xarray multiindex data in chunks

阅读更多关于 Writing xarray multiindex data in chunks

How to properly use dask's upload_file() to pass local code to workers

阅读更多关于 How to properly use dask's upload_file() to pass local code to workers

来源： https://stackoverflow.com/questions/57118226/how-to-properly-use-dasks-upload-file-to-pass-local-code-to-workers