Access a single element in large published array with Dask

一个人想着一个人 提交于 2019-12-11 06:37:53

问题


Is there a faster way to only retrieve a single element in a large published array with Dask without retrieving the entire array?

In the example below client.get_dataset('array1')[0] takes roughly the same time as client.get_dataset('array1').

import distributed
client = distributed.Client()
data = [1]*10000000
payload = {'array1': data}
client.publish(**payload)

one_element = client.get_dataset('array1')[0]

回答1:


Note that anything you publish goes to the scheduler, not to the workers, so this is somewhat inefficient. Publish was intended to be used with Dask collections like dask.array.

Client 1

import dask.array as da
x = da.ones(10000000, chunks=(100000,))  # 1e7 size array cut into 1e5 size chunks
x = x.persist()  # persist array on the workers of the cluster

client.publish(x=x)  # store the metadata of x on the scheduler

Client 2

x = client.get_dataset('x')  # get the lazy collection x
x[0].compute()  # this selection happens on the worker, only the result comes down


来源:https://stackoverflow.com/questions/45225485/access-a-single-element-in-large-published-array-with-dask

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!