dask-delayed

Generating batches of images in dask

心已入冬 提交于 2019-12-10 22:48:09
问题 I just started with dask because it offers great parallel processing power. I have around 40000 images on my disk which I am going to use for building a classifier using some DL library, say Keras or TF . I collected this meta-info(image path and corresponding label) in a pandas dataframe, which looks like this: img_path labels 0 data/1.JPG 1 1 data/2.JPG 1 2 data/3.JPG 5 ... Now here is my simple task: Use dask to read images and corresponding labels in a lazy fashion. Do some processing on

Unpacking result of delayed function

早过忘川 提交于 2019-12-08 17:15:13
问题 While converting my program using delayed, I stumbled upon a commonly used programming pattern that doesn't work with delayed. Example: from dask import delayed @delayed def myFunction(): return 1,2 a, b = myFunction() a.compute() Raises: TypeError: Delayed objects of unspecified length are not iterable While the following work around does not. But looks a lot more clumsy from dask import delayed @delayed def myFunction(): return 1,2 dummy = myFunction() a, b = dummy[0], dummy[1] a.compute()

File Not Found Error in Dask program run on cluster

与世无争的帅哥 提交于 2019-12-08 04:51:12
问题 I have 4 machines, M1, M2, M3, and M4. The scheduler, client, worker runs on M1. I've put a csv file in M1. Rest of the machines are workers. When I run the program with read_csv file in dask. It gives me Error, file not found 回答1: When one of your workers tries to load the CSV, it will not be able to find it, because it is not present on that local disc. This should not be a surprise. You can get around this in a number of ways: copy the file to every worker; this is obviously wasteful in

Sorting in Dask

帅比萌擦擦* 提交于 2019-12-07 10:05:41
问题 I want to find an alternative of pandas.dataframe.sort_value function in dask. I came through set_index , but it would sort on a single column. How can I sort multiple columns of Dask data frame? 回答1: So far Dask does not seem to support sorting by multiple columns. However, making a new column that concatenates the values of the sorted columns may be a usable work-around. d['new_column'] = d.apply(lambda r: str([r.col1,r.col2]), axis=1) d = d.set_index('new_column') d = d.map_partitions

File Not Found Error in Dask program run on cluster

蹲街弑〆低调 提交于 2019-12-06 15:12:20
I have 4 machines, M1, M2, M3, and M4. The scheduler, client, worker runs on M1. I've put a csv file in M1. Rest of the machines are workers. When I run the program with read_csv file in dask. It gives me Error, file not found When one of your workers tries to load the CSV, it will not be able to find it, because it is not present on that local disc. This should not be a surprise. You can get around this in a number of ways: copy the file to every worker; this is obviously wasteful in terms of disc space, but the easiest to achieve place the file on a networked filesystem (NFS mount, gluster,

Sorting in Dask

≯℡__Kan透↙ 提交于 2019-12-05 13:38:46
I want to find an alternative of pandas.dataframe.sort_value function in dask. I came through set_index , but it would sort on a single column. How can I sort multiple columns of Dask data frame? So far Dask does not seem to support sorting by multiple columns. However, making a new column that concatenates the values of the sorted columns may be a usable work-around. d['new_column'] = d.apply(lambda r: str([r.col1,r.col2]), axis=1) d = d.set_index('new_column') d = d.map_partitions(lambda x: x.sort_index()) Edit: The above works if you want to sort by two strings. I recommend creating integer

convert dask.bag of dictionaries to dask.dataframe using dask.delayed and pandas.DataFrame

眉间皱痕 提交于 2019-12-05 01:28:03
问题 I am struggling to convert a dask.bag of dictionaries into dask.delayed pandas.DataFrames into a final dask.dataframe I have one function (make_dict) that reads files into a rather complex nested dictionary structure and another function (make_df) to turn these dictionaries into a pandas.DataFrame (resulting dataframe is around 100 mb for each file). I would like to append all dataframes into a single dask.dataframe for further analysis. Up to now I was using dask.delayed objects to load,