dask

dask handle delayed failures

青春壹個敷衍的年華 提交于 2020-12-16 02:21:31
问题 How can I port the following function to dask in order to parallelize it? from time import sleep from dask.distributed import Client from dask import delayed client = Client(n_workers=4) from tqdm import tqdm tqdm.pandas() # linear things = [1,2,3] _x = [] _y = [] def my_slow_function(foo): sleep(2) x = foo y = 2 * foo assert y < 5 return x, y for foo in tqdm(things): try: x_v, y_v = my_slow_function(foo) _x.append(x_v) if y_v is not None: _y.append(y_v) except AssertionError: print(f'failed:

Dask scheduler empty / graph not showing

£可爱£侵袭症+ 提交于 2020-12-15 06:40:00
问题 I have a setup as follows: # etl.py from dask.distributed import Client import dask from tasks import task1, task2, task3 def runall(**kwargs): print("done") def etl(): client = Client() tasks = {} tasks['task1'] = dask.delayed(task)(*args) tasks['task2'] = dask.delayed(task)(*args) tasks['task3'] = dask.delayed(task)(*args) out = dask.delayed(runall)(**tasks) out.compute() This logic was borrowed from luigi and works nicely with if statements to control what tasks to run. However, some of

How do the batching instructions of Dask delayed best practices work?

白昼怎懂夜的黑 提交于 2020-12-15 06:16:59
问题 I guess I'm missing something (still a Dask Noob) but I'm trying the batching suggestion to avoid too many Dask tasks from here: https://docs.dask.org/en/latest/delayed-best-practices.html and can't make them work. This is what I tried: import dask def f(x): return x*x def batch(seq): sub_results = [] for x in seq: sub_results.append(f(x)) return sub_results batches = [] for i in range(0, 1000000000, 1000000): result_batch = dask.delayed(batch, range(i, i + 1000000)) batches.append(result

Simple way to Dask concatenate (horizontal, axis=1, columns)

雨燕双飞 提交于 2020-12-04 18:12:36
问题 Action Reading two csv (data.csv and label.csv) to a single dataframe. df = dd.read_csv(data_files, delimiter=' ', header=None, names=['x', 'y', 'z', 'intensity', 'r', 'g', 'b']) df_label = dd.read_csv(label_files, delimiter=' ', header=None, names=['label']) Problem Concatenation of columns requires known divisions. However setting an index will sort the data, which I explicitly do not want, because order of both files is their match. df = dd.concat([df, df_label], axis=1) ------------------

Simple way to Dask concatenate (horizontal, axis=1, columns)

折月煮酒 提交于 2020-12-04 18:10:36
问题 Action Reading two csv (data.csv and label.csv) to a single dataframe. df = dd.read_csv(data_files, delimiter=' ', header=None, names=['x', 'y', 'z', 'intensity', 'r', 'g', 'b']) df_label = dd.read_csv(label_files, delimiter=' ', header=None, names=['label']) Problem Concatenation of columns requires known divisions. However setting an index will sort the data, which I explicitly do not want, because order of both files is their match. df = dd.concat([df, df_label], axis=1) ------------------

Simple way to Dask concatenate (horizontal, axis=1, columns)

时光怂恿深爱的人放手 提交于 2020-12-04 18:09:14
问题 Action Reading two csv (data.csv and label.csv) to a single dataframe. df = dd.read_csv(data_files, delimiter=' ', header=None, names=['x', 'y', 'z', 'intensity', 'r', 'g', 'b']) df_label = dd.read_csv(label_files, delimiter=' ', header=None, names=['label']) Problem Concatenation of columns requires known divisions. However setting an index will sort the data, which I explicitly do not want, because order of both files is their match. df = dd.concat([df, df_label], axis=1) ------------------

Simple way to Dask concatenate (horizontal, axis=1, columns)

会有一股神秘感。 提交于 2020-12-04 18:06:50
问题 Action Reading two csv (data.csv and label.csv) to a single dataframe. df = dd.read_csv(data_files, delimiter=' ', header=None, names=['x', 'y', 'z', 'intensity', 'r', 'g', 'b']) df_label = dd.read_csv(label_files, delimiter=' ', header=None, names=['label']) Problem Concatenation of columns requires known divisions. However setting an index will sort the data, which I explicitly do not want, because order of both files is their match. df = dd.concat([df, df_label], axis=1) ------------------

Writing xarray multiindex data in chunks

隐身守侯 提交于 2020-12-02 06:50:40
问题 I am trying to efficiently restructure a large multidimentional dataset. Let assume I have a number of remotely sensed images over time with a number of bands with coordinates x y for pixel location, time for time of image acquisition, and band for different data collected. In my use case lets assume the xarray coord lengths are roughly x (3000), y (3000), time (10), with bands (40) of floating point data. So 100gb+ of data. I have been trying to work from this example but I am having trouble

Writing xarray multiindex data in chunks

耗尽温柔 提交于 2020-12-02 06:48:12
问题 I am trying to efficiently restructure a large multidimentional dataset. Let assume I have a number of remotely sensed images over time with a number of bands with coordinates x y for pixel location, time for time of image acquisition, and band for different data collected. In my use case lets assume the xarray coord lengths are roughly x (3000), y (3000), time (10), with bands (40) of floating point data. So 100gb+ of data. I have been trying to work from this example but I am having trouble