dask

Difference between dask.distributed LocalCluster with threads vs. processes

旧巷老猫 提交于 2021-02-07 06:46:10
问题 What is the difference between the following LocalCluster configurations for dask.distributed ? Client(n_workers=4, processes=False, threads_per_worker=1) versus Client(n_workers=1, processes=True, threads_per_worker=4) They both have four threads working on the task graph, but the first has four workers. What, then, would be the benefit of having multiple workers acting as threads as opposed to a single worker with multiple threads? Edit : just a clarification, I'm aware of the difference

Dask dataframe split partitions based on a column or function

做~自己de王妃 提交于 2021-02-06 20:48:40
问题 I have recently begun looking at Dask for big data. I have a question on efficiently applying operations in parallel. Say I have some sales data like this: customerKey productKey transactionKey grossSales netSales unitVolume volume transactionDate ----------- -------------- ---------------- ---------- -------- ---------- ------ -------------------- 20353 189 219548 0.921058 0.921058 1 1 2017-02-01 00:00:00 2596618 189 215015 0.709997 0.709997 1 1 2017-02-01 00:00:00 30339435 189 215184 0

Dask dataframe split partitions based on a column or function

妖精的绣舞 提交于 2021-02-06 20:45:47
问题 I have recently begun looking at Dask for big data. I have a question on efficiently applying operations in parallel. Say I have some sales data like this: customerKey productKey transactionKey grossSales netSales unitVolume volume transactionDate ----------- -------------- ---------------- ---------- -------- ---------- ------ -------------------- 20353 189 219548 0.921058 0.921058 1 1 2017-02-01 00:00:00 2596618 189 215015 0.709997 0.709997 1 1 2017-02-01 00:00:00 30339435 189 215184 0

Efficient way to read 15 M lines csv files in python

泄露秘密 提交于 2021-02-05 18:54:07
问题 For my application, I need to read multiple files with 15 M lines each, store them in a DataFrame, and save the DataFrame in HDFS5 format. I've already tried different approaches, notably pandas.read_csv with chunksize and dtype specifications, and dask.dataframe. They both take around 90 seconds to treat 1 file, and so I'd like to know if there's a way to efficiently treat these files in the described way. In the following, I show some code of the tests I've done. import pandas as pd import

Efficient way to read 15 M lines csv files in python

别等时光非礼了梦想. 提交于 2021-02-05 18:52:03
问题 For my application, I need to read multiple files with 15 M lines each, store them in a DataFrame, and save the DataFrame in HDFS5 format. I've already tried different approaches, notably pandas.read_csv with chunksize and dtype specifications, and dask.dataframe. They both take around 90 seconds to treat 1 file, and so I'd like to know if there's a way to efficiently treat these files in the described way. In the following, I show some code of the tests I've done. import pandas as pd import

Create Dataframe from a nested dictionary

余生长醉 提交于 2021-02-05 07:22:25
问题 I am trying to create a dataframe from a list of values which has nested dictionaries So this is my data d=[{'user': 200, 'p_val': {'a': 10, 'b': 200}, 'f_val': {'a': 20, 'b': 300}, 'life': 8}, {'user': 202, 'p_val': {'a': 100, 'b': 200}, 'f_val': {'a': 200, 'b': 300}, 'life': 8}] i am trying to turn it into a dataframe as follows: user new_col f_val p_val life 200 a 20 10 8 200 b 300 200 8 202 a 200 100 8 202 b 300 200 8 I looked at other answers, none of them matched my requirement. The

Parallelization on cluster dask

纵饮孤独 提交于 2021-01-29 10:28:50
问题 I'm looking for the best way to parallelize on a cluster the following problem. I have several files folder/file001.csv folder/file002.csv : folder/file100.csv They are disjoints with respect to the key I want to use to groupby, that is if a set of keys is in file1.csv any of these keys has an item in any other files. In one side I can just run df = dd.read_csv("folder/*") df.groupby("key").apply(f, meta=meta).compute(scheduler='processes') But I'm wondering if there is a better/smarter way

Dask-distributed. How to get task key ID in the function being calculated?

旧街凉风 提交于 2021-01-29 00:58:18
问题 My computations with dask.distributed include creation of intermediate files whose names include UUID4, that identify that chunk of work. pairs = '{}\n{}\n{}\n{}'.format(list1, list2, list3, ...) file_path = os.path.join(job_output_root, 'pairs', 'pairs-{}.txt'.format(str(uuid.uuid4()).replace('-', ''))) file(file_path, 'wt').writelines(pairs) In the same time, all tasks in the dask distributed cluster have unique keys. Therefore, it would be natural to use that key ID for file name. Is it

Dask-distributed. How to get task key ID in the function being calculated?

删除回忆录丶 提交于 2021-01-29 00:56:11
问题 My computations with dask.distributed include creation of intermediate files whose names include UUID4, that identify that chunk of work. pairs = '{}\n{}\n{}\n{}'.format(list1, list2, list3, ...) file_path = os.path.join(job_output_root, 'pairs', 'pairs-{}.txt'.format(str(uuid.uuid4()).replace('-', ''))) file(file_path, 'wt').writelines(pairs) In the same time, all tasks in the dask distributed cluster have unique keys. Therefore, it would be natural to use that key ID for file name. Is it

Dask-distributed. How to get task key ID in the function being calculated?

≡放荡痞女 提交于 2021-01-29 00:52:45
问题 My computations with dask.distributed include creation of intermediate files whose names include UUID4, that identify that chunk of work. pairs = '{}\n{}\n{}\n{}'.format(list1, list2, list3, ...) file_path = os.path.join(job_output_root, 'pairs', 'pairs-{}.txt'.format(str(uuid.uuid4()).replace('-', ''))) file(file_path, 'wt').writelines(pairs) In the same time, all tasks in the dask distributed cluster have unique keys. Therefore, it would be natural to use that key ID for file name. Is it