dask | 易学教程

Difference between dask.distributed LocalCluster with threads vs. processes

阅读更多关于 Difference between dask.distributed LocalCluster with threads vs. processes

问题 What is the difference between the following LocalCluster configurations for dask.distributed ? Client(n_workers=4, processes=False, threads_per_worker=1) versus Client(n_workers=1, processes=True, threads_per_worker=4) They both have four threads working on the task graph, but the first has four workers. What, then, would be the benefit of having multiple workers acting as threads as opposed to a single worker with multiple threads? Edit : just a clarification, I'm aware of the difference

Dask dataframe split partitions based on a column or function

阅读更多关于 Dask dataframe split partitions based on a column or function

问题 I have recently begun looking at Dask for big data. I have a question on efficiently applying operations in parallel. Say I have some sales data like this: customerKey productKey transactionKey grossSales netSales unitVolume volume transactionDate ----------- -------------- ---------------- ---------- -------- ---------- ------ -------------------- 20353 189 219548 0.921058 0.921058 1 1 2017-02-01 00:00:00 2596618 189 215015 0.709997 0.709997 1 1 2017-02-01 00:00:00 30339435 189 215184 0

Dask dataframe split partitions based on a column or function

阅读更多关于 Dask dataframe split partitions based on a column or function

Efficient way to read 15 M lines csv files in python

阅读更多关于 Efficient way to read 15 M lines csv files in python

问题 For my application, I need to read multiple files with 15 M lines each, store them in a DataFrame, and save the DataFrame in HDFS5 format. I've already tried different approaches, notably pandas.read_csv with chunksize and dtype specifications, and dask.dataframe. They both take around 90 seconds to treat 1 file, and so I'd like to know if there's a way to efficiently treat these files in the described way. In the following, I show some code of the tests I've done. import pandas as pd import

Efficient way to read 15 M lines csv files in python

阅读更多关于 Efficient way to read 15 M lines csv files in python

Create Dataframe from a nested dictionary

阅读更多关于 Create Dataframe from a nested dictionary

问题 I am trying to create a dataframe from a list of values which has nested dictionaries So this is my data d=[{'user': 200, 'p_val': {'a': 10, 'b': 200}, 'f_val': {'a': 20, 'b': 300}, 'life': 8}, {'user': 202, 'p_val': {'a': 100, 'b': 200}, 'f_val': {'a': 200, 'b': 300}, 'life': 8}] i am trying to turn it into a dataframe as follows: user new_col f_val p_val life 200 a 20 10 8 200 b 300 200 8 202 a 200 100 8 202 b 300 200 8 I looked at other answers, none of them matched my requirement. The

Parallelization on cluster dask

阅读更多关于 Parallelization on cluster dask

问题 I'm looking for the best way to parallelize on a cluster the following problem. I have several files folder/file001.csv folder/file002.csv : folder/file100.csv They are disjoints with respect to the key I want to use to groupby, that is if a set of keys is in file1.csv any of these keys has an item in any other files. In one side I can just run df = dd.read_csv("folder/*") df.groupby("key").apply(f, meta=meta).compute(scheduler='processes') But I'm wondering if there is a better/smarter way

Dask-distributed. How to get task key ID in the function being calculated?

阅读更多关于 Dask-distributed. How to get task key ID in the function being calculated?

问题 My computations with dask.distributed include creation of intermediate files whose names include UUID4, that identify that chunk of work. pairs = '{}\n{}\n{}\n{}'.format(list1, list2, list3, ...) file_path = os.path.join(job_output_root, 'pairs', 'pairs-{}.txt'.format(str(uuid.uuid4()).replace('-', ''))) file(file_path, 'wt').writelines(pairs) In the same time, all tasks in the dask distributed cluster have unique keys. Therefore, it would be natural to use that key ID for file name. Is it

Dask-distributed. How to get task key ID in the function being calculated?

阅读更多关于 Dask-distributed. How to get task key ID in the function being calculated?

Dask-distributed. How to get task key ID in the function being calculated?

阅读更多关于 Dask-distributed. How to get task key ID in the function being calculated?