发表新帖

发表新帖

Dask dataframe split partitions based on a column or function

前端未结

关注

 2  1061

情话喂你 2021-02-14 14:37

I have recently begun looking at Dask for big data. I have a question on efficiently applying operations in parallel.

Say I have some sales data like this:

cu

2条回答

后悔当初 (楼主)

2021-02-14 15:05
You can set your column to be the index
```
df = df.set_index('customerKey')
```
This will sort your data by that column and track which ranges of values are in which partition. As you note this is likely to be an expensive operation, you you'll probably want to save it somewhere

Either in memory
```
df = df.persist()
```
or on disk
```
df.to_parquet('...')
df = df.read_parquet('...')
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题